diff --git a/.agents/skills/code-reviewer/SKILL.md b/.agents/skills/code-reviewer/SKILL.md
new file mode 100644
index 000000000..5d0fae46c
--- /dev/null
+++ b/.agents/skills/code-reviewer/SKILL.md
@@ -0,0 +1,229 @@
+---
+name: code-reviewer
+description: PR review checklist for semba-fdtd: CI verification across Ubuntu/Windows matrix, Fortran code review patterns, conditional compilation guards, test coverage expectations, MPI/OpenMP safety checks, and AI-assisted contribution policies
+---
+
+## When to use
+
+- Reviewing a pull request before merging
+- Evaluating code quality and correctness of Fortran/Python changes
+- Checking that tests cover new functionality
+- Verifying conditional compilation guards are correct
+- Ensuring documentation is updated for user-facing changes
+- Reviewing AI-assisted contributions for correctness
+
+## Key Files to Read
+
+### Project Standards
+- `CONTRIBUTING.md` — Contribution guidelines, review requirements
+- `AGENTS.md` — AI coding agent guidance (this file)
+- `README.md` — Project overview and quality expectations
+
+### CI/CD Pipelines
+- `.github/workflows/ubuntu.yml` — Ubuntu CI: GCC + Intel, MPI ON/OFF, MTLN ON/OFF
+- `.github/workflows/windows.yml` — Windows CI: IntelLLVM + Ninja
+- `.github/workflows/check-submodules-default-branch.yml` — Submodule integrity checks
+
+### Code Style References
+- `CMakeLists.txt` — Compiler flags, build configuration
+- `doc/fdtdjson.md` — Input format spec (check for input format changes)
+- `doc/development.md` — Development setup and conventions
+
+## What CI Checks Must Pass
+
+Every PR must pass **all** of the following before merging:
+
+### Ubuntu CI (`ubuntu.yml`)
+14 build configurations in matrix:
+- Compilers: GCC 11 + Intel 2025.1
+- MPI: ON and OFF
+- MTLN: ON and OFF
+- HDF: ON (always)
+- Double precision: OFF (default) + 1 Intel config with ON
+
+Each configuration must:
+- Build successfully
+- Pass `build/bin/fdtd_tests` (unit tests)
+- Pass `pytest test/ --durations=20` (integration tests)
+
+### Windows CI (`windows.yml`)
+6 build configurations:
+- Compiler: IntelLLVM 2025.2.1 + Ninja
+- MTLN: ON and OFF
+- HDF: ON
+- Double precision: ON and OFF
+
+Each configuration must:
+- Build successfully
+- Pass `build/bin/fdtd_tests.exe`
+- Pass `pytest -m 'not codemodel' test/ --durations=20` (codemodel excluded on Windows)
+
+### Submodule Check
+- All submodule commits must be ancestors of their `main`/`master` branch history
+
+## Review Checklist
+
+### 1. General PR Quality
+
+- [ ] PR description includes a short summary of the change
+- [ ] PR description includes motivation/context (what problem it solves)
+- [ ] PR description explains how it was tested (commands run, platforms used)
+- [ ] PR is opened against `dev` (or `main` per current workflow)
+- [ ] Commit messages are clear, descriptive, and in English
+- [ ] Commit messages reference issue numbers where applicable (e.g., `#392`)
+- [ ] Changes are focused and small (not mixing unrelated changes)
+- [ ] Small fixup commits are squashed
+
+### 2. Fortran Code Review
+
+- [ ] **Code style matches surrounding code** — no new indentation, naming, or structural conventions introduced
+- [ ] **Conditional compilation guards are correct** — check that `#ifdef` guards match the CMake flags:
+  - `SEMBA_FDTD_ENABLE_MPI` — MPI support
+  - `SEMBA_FDTD_ENABLE_HDF` — HDF5 output
+  - `SEMBA_FDTD_ENABLE_MTLN` — MTLN solver + ngspice
+  - `SEMBA_FDTD_ENABLE_SMBJSON` — JSON input parser
+  - `SEMBA_FDTD_ENABLE_DOUBLE_PRECISION` — 8-byte reals
+- [ ] **Module dependencies are respected** — code doesn't depend on modules below it in the library chain:
+  ```
+  semba-types -> semba-reports -> smbjson/conformal/semba-components
+    -> mtlnsolver/semba-outputs -> semba-main -> semba-fdtd
+  ```
+- [ ] **No new unused variables or dead code**
+- [ ] **Array bounds are safe** — FDTD uses extensive pointer aliasing; verify no out-of-bounds access
+- [ ] **Memory management is correct** — no leaks in derived type allocation/deallocation
+- [ ] **OpenMP directives are correct** — if adding parallel loops, verify `COLLAPSE`, `DEFAULT(SHARED)`, and private variables
+- [ ] **MPI communication is safe** — if modifying MPI code, verify non-blocking patterns and buffer sizes
+
+### 3. Build System Review
+
+- [ ] **CMakeLists.txt changes are minimal** — only add new targets, don't restructure existing ones
+- [ ] **New optional features have CMake options** — follow the pattern of existing flags
+- [ ] **Compiler flags are correct** — match existing patterns for the target compiler
+- [ ] **Windows compatibility** — if changes affect Windows, verify IntelLLVM + Ninja builds
+- [ ] **Submodule changes are intentional** — verify submodule commits are on accepted branch history
+
+### 4. Test Coverage
+
+- [ ] **New functionality has corresponding tests** — both unit tests and integration tests where applicable
+- [ ] **Bug fixes have regression tests** — a test that would have caught the original bug
+- [ ] **Test data is appropriate** — minimal but sufficient to exercise the code path
+- [ ] **Tests use correct pytest markers** — `mtln`, `hdf`, `mpi`, `codemodel`
+- [ ] **Tests are isolated** — Python tests use `tmp_path`, don't write to repo directory
+- [ ] **Reference outputs are updated** — if expected outputs change, `testData/outputs/` is updated
+- [ ] **All existing tests still pass** — no regressions
+
+### 5. Documentation
+
+- [ ] **`doc/` is updated** for user-facing behavior changes
+- [ ] **`.fdtd.json` changes reference `doc/fdtdjson.md`** — verify input format is documented
+- [ ] **New CMake options are documented** in `AGENTS.md` and `CLAUDE.md`
+- [ ] **Tutorial examples are updated** if the workflow changes
+
+### 6. AI-Assisted Contributions
+
+If the PR has significant AI-generated content:
+- [ ] PR description clearly states AI assistance was used
+- [ ] Extent of AI assistance is described
+- [ ] **At least two human reviewers** have approved (per CONTRIBUTING.md)
+- [ ] Human author has taken responsibility for correctness
+- [ ] Code quality is indistinguishable from human-written code in this repo
+
+## Common Issues to Watch For
+
+### Conditional Compilation Gotchas
+
+    ! BAD: Missing guard for optional feature
+    call some_mtl_function()   ! Only available with SEMBA_FDTD_ENABLE_MTLN
+
+    ! GOOD: Properly guarded
+    #ifdef SEMBA_FDTD_ENABLE_MTLN
+    call some_mtl_function()
+    #endif
+
+### MPI Safety
+
+- [ ] Non-blocking operations (`MPI_ISEND`/`MPI_IRECV`) are followed by `MPI_WAITALL`
+- [ ] Buffer sizes account for ghost cell exchange correctly
+- [ ] Wire MPI communication is updated if wire models are modified
+- [ ] `MPI_Barrier` usage is appropriate (not causing deadlocks)
+
+### OpenMP Safety
+
+- [ ] `COLLAPSE(n)` dimensions are valid for the loop nest
+- [ ] Private/shared variable clauses are correct
+- [ ] No data races on shared arrays
+- [ ] Reduction clauses used where accumulators are involved
+
+### Memory Safety
+
+- [ ] No pointer aliasing that could cause false dependencies in parallel regions
+- [ ] Derived types are properly initialized before use
+- [ ] Arrays are allocated before being accessed
+- [ ] `contiguous` attribute used on field pointers for cache-friendly access
+
+### Input Format Changes
+
+If `.fdtd.json` format is modified:
+- [ ] `doc/fdtdjson.md` is updated with the new format
+- [ ] Backward compatibility is maintained if possible
+- [ ] Migration guide is provided for existing input files
+- [ ] Parser tests (`test/smbjson/`) are updated
+
+## Review Process
+
+### Step-by-Step
+
+1. **Read the PR description** — understand the intent and context
+2. **Check CI status** — verify all CI jobs pass (or are expected to fail)
+3. **Review the diff** — line by line for logic errors, style issues
+4. **Verify test coverage** — ensure new code has tests
+5. **Check conditional compilation** — verify all `#ifdef` guards are correct
+6. **Verify documentation** — ensure user-facing changes are documented
+7. **Run local tests** (if needed) — build and run tests for the specific change
+8. **Leave constructive feedback** — be specific, suggest fixes, explain why
+
+### Feedback Style
+
+- Be specific about what needs changing and why
+- Suggest concrete fixes when possible
+- Distinguish between blocking issues (must fix) and suggestions (nice to have)
+- Acknowledge good changes and improvements
+- Keep discussion technical and respectful
+
+## Quick Reference: What to Check by Change Type
+
+### Adding a new feature
+- New code follows existing conventions
+- New CMake option (if optional feature)
+- New tests (unit + integration)
+- Documentation updated
+- Test data added
+
+### Fixing a bug
+- Root cause identified and explained
+- Regression test added
+- Minimal changes (no over-engineering)
+- All existing tests still pass
+
+### Modifying optional feature
+- `#ifdef` guards are correct for all build configurations
+- Tested with feature ON and OFF
+- CI matrix includes the relevant configurations
+
+### Modifying MPI code
+- Non-blocking I/O pattern preserved
+- Buffer sizes correct for all domain sizes
+- Wire MPI communication updated if needed
+- No deadlocks or race conditions
+
+### Modifying OpenMP code
+- Parallel correctness verified
+- Performance impact assessed
+- Thread scaling tested
+- No false sharing or cache contention
+
+### Modifying input format
+- `doc/fdtdjson.md` updated
+- Parser tests updated
+- Backward compatibility maintained
+- Migration guide provided
diff --git a/.agents/skills/emc-engineer/SKILL.md b/.agents/skills/emc-engineer/SKILL.md
new file mode 100644
index 000000000..34688645f
--- /dev/null
+++ b/.agents/skills/emc-engineer/SKILL.md
@@ -0,0 +1,285 @@
+---
+name: emc-engineer
+description: Writing .fdtd.json inputs, launching simulations via pyWrapper FDTD class, analyzing probe outputs (.dat files) in time/frequency domains, generating excitation signals with NumPy, and common EMC analysis patterns like shielding effectiveness and radar cross section
+---
+
+## When to use
+
+- Writing or modifying `.fdtd.json` input files for new simulation cases
+- Launching simulations and monitoring progress
+- Analyzing probe output data (`.dat` files) in time and frequency domains
+- Creating pre/post-processing Python scripts with the pyWrapper interface
+- Comparing FDTD results with analytical models or measurements
+- Setting up excitation signals (`.exc` files) for time-domain sources
+
+## Key Files to Read
+
+### Core Solver
+- `src_main_pub/launcher.F90` — Entry point (15 lines)
+- `src_main_pub/semba_fdtd.F90` — Main type: init/launch/end lifecycle
+- `src_main_pub/preprocess_geom.F90` — Geometry parsing, mesh building, material assignment
+- `src_main_pub/postprocess.F90` — DTFT, transfer functions, resampling
+- `src_main_pub/observation.F90` — Probe data collection and storage
+
+### Python Interface
+- `src_pyWrapper/pyWrapper.py` — FDTD, Probe, ExcitationFile classes
+
+### Documentation
+- `doc/fdtdjson.md` — Complete `.fdtd.json` input format specification (876 lines)
+- `doc/tutorials/veritasium/veritasium.md` — End-to-end workflow tutorial
+- `doc/mtln.md` — MTLN solver documentation
+
+### Example Cases
+- `testData/input_examples/` — 23 `.fdtd.json` files demonstrating various features
+- `testData/cases/` — 44 full simulation case directories
+- `testData/excitations/` — Source waveform definitions
+- `testData/netlists/` — SPICE netlist files for MTLN
+- `testData/cases/planewave/pw_prepost.py` — Simple pre/post example
+- `testData/cases/sgbcShieldingEffectiveness/sgbc_prepost.py` — EMC analysis with analytical comparison
+- `testData/cases/holland/holland_prepost.py` — Wire case comparison
+
+## Simulation Workflow
+
+### Step 1: Write the Input File
+
+The primary input format is `.fdtd.json` with these top-level keys:
+
+| Key | Required | Description |
+|-----|----------|-------------|
+| `general` | Yes | `timeStep`, `numberOfSteps`, optional `mtlnProblem`, `additionalArguments` |
+| `boundary` | No (defaults to `mur` on all sides) | Per-face: `pec`, `pmc`, `periodic`, `mur`, `pml` |
+| `mesh` | Yes | Contains `grid`, `coordinates`, `elements` |
+| `materials` | No | Material definitions |
+| `materialAssociations` | No | Links materials to mesh elements |
+| `sources` | No | Electromagnetic sources |
+| `probes` | No | Output sensors |
+
+### Mesh Definition
+
+```json
+"mesh": {
+  "grid": {
+    "numberOfCells": [nx, ny, nz],
+    "steps": {"x": [dx], "y": [dy], "z": [dz]},
+    "origin": [0, 0, 0]
+  },
+  "coordinates": {
+    "name": {"relativePosition": [cellIndex, fractionalX, fractionalY, fractionalZ]}
+  },
+  "elements": [
+    {"type": "node", "position": [x, y, z], "id": materialID},
+    {"type": "polyline", "points": [[x1,y1,z1], ...], "id": materialID},
+    {"type": "cell", "intervals": [[[ax,ay,az],[bx,by,bz]], ...], "id": materialID}
+  ]
+}
+```
+
+**Interval convention for `cell` elements:** An interval `[[ax,ay,az],[bx,by,bz]]` defines a region `[ax,bx) x [ay,by) x [az,bz)`. Depending on how many dimensions differ:
+- 0 dimensions differ = point
+- 1 dimension differs = oriented line
+- 2 dimensions differ = oriented surface
+- 3 dimensions differ = volume
+
+### Step 2: Generate Excitation Signals (if needed)
+
+For time-domain sources, generate `.exc` files using NumPy:
+
+```python
+import numpy as np
+
+# Gaussian pulse
+dt = 1e-12
+t0 = 20 * dt
+w0 = 5 * dt
+t = np.arange(0, t0 + 20 * w0, dt)
+f = np.exp(-(t - t0)**2 / w0**2)
+np.savetxt('gauss.exc', np.column_stack((t, f)))
+
+# Sigmoid step
+t = np.linspace(0, 10e-8, 2000)
+V = A * (sigmoid_raw - offset) * scaling
+np.savetxt('./step.exc', np.column_stack((t, V)))
+```
+
+### Step 3: Launch the Simulation
+
+**Direct command line:**
+```bash
+semba-fdtd -i CASE_NAME.fdtd.json
+mpirun -n NPROCS semba-fdtd -i CASE_NAME.fdtd.json
+```
+
+**Via Python pyWrapper:**
+```python
+from src_pyWrapper.pyWrapper import FDTD
+
+solver = FDTD(
+    input_filename='my_case.fdtd.json',
+    path_to_exe='/path/to/semba-fdtd',
+    flags=['-mapvtk'],          # Optional: generate VTK geometry map
+    mpi_command='mpirun -n 4'   # Optional: MPI launch
+)
+solver.cleanUp()    # Remove old output files
+solver.run()        # Execute the solver
+print(f"Return code: {solver.returncode}")
+```
+
+**Key command-line flags:**
+| Flag | Purpose |
+|------|---------|
+| `-mapvtk` | Generate VTK geometry map files for ParaView |
+| `-sgbc` | Enable SGBC (multi-layer surface) solver |
+| `-r` | Resume from a previous checkpoint |
+| `-stoch` / `-nostoch` | Stochastic simulation toggle |
+| `-cfl <value>` | Override CFL factor |
+
+### Step 4: Monitor Progress
+
+Check `SEMBA_FDTD_temp.log` for progress. The solver writes timing info periodically.
+
+### Step 5: Analyze Results
+
+**Output file types:**
+
+| Extension | Description | Example |
+|-----------|-------------|---------|
+| `.dat` | Time-domain probe data (`time field_value`) | `case_name.fdtd_probeName_Ex_5_10_3.dat` |
+| `_df.dat` | Frequency-domain (`freq magnitude phase`) | `case_name.fdtd_probeName_Ex_5_10_3_df.dat` |
+| `_tr.dat` | Transfer function in dB (`freq 20*log10(mag) phase`) | `case_name.fdtd_probeName_Ex_5_10_3_tr.dat` |
+| `.xdmf` + `.h5` | 3D field movies (HDF5 binary + XDMF metadata) | `case_name.xdmf` |
+| `.vtk` | Geometry map for ParaView | `case_name_1.vtk` |
+| `.old` | Restart field checkpoint | `case_name.fields.old` |
+
+**Load and analyze with pyWrapper:**
+```python
+from src_pyWrapper.pyWrapper import Probe, FDTD
+
+solver = FDTD(input_filename='pw-in-box.fdtd.json', path_to_exe=SEMBA_EXE)
+solver.run()
+
+# Load time-domain probe
+before = Probe(solver.getSolvedProbeFilenames("before")[0])
+print(before.dataframe)        # pandas DataFrame
+print(before.name)             # Extracted probe name
+print(before.field_column)     # Field column name
+print(before.cell_positions)   # Cell positions
+
+# Plot time domain
+import matplotlib.pyplot as plt
+plt.plot(before['time'], before['field'], label='Ex component')
+plt.xlabel('Time (s)')
+plt.ylabel('Field (V/m)')
+plt.legend()
+plt.show()
+
+# Load frequency-domain probe
+freq_probe = Probe(solver.getSolvedProbeFilenames("before")[0].replace('.dat', '_df.dat'))
+print(freq_probe.dataframe)    # Columns: frequency, magnitude, phase
+```
+
+**Manual analysis:**
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+
+# Time-domain data
+data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3.dat', skiprows=1)
+time = data[:, 0]
+field = data[:, 1]
+plt.plot(time, field)
+
+# Frequency-domain data
+freq_data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3_df.dat', skiprows=1)
+freq = freq_data[:, 0]
+magnitude = freq_data[:, 1]
+phase = freq_data[:, 2]
+
+# Transfer function
+tr_data = np.loadtxt('case_name.fdtd_probe_Ex_5_10_3_tr.dat', skiprows=1)
+plt.plot(tr_data[:, 0], tr_data[:, 1])  # dB vs frequency
+
+# FFT-based postprocessing (e.g., shielding effectiveness)
+INC = np.fft.fft(back['incident'])
+S transmitted = np.fft.fft(back['field']) / INC
+fq = np.fft.fftfreq(len(t)) / dt
+```
+
+## Common Analysis Patterns
+
+### Shielding Effectiveness (SGBC)
+```python
+# After running simulation with front/back probes
+front = Probe(front_files[0])
+back = Probe(back_files[0])
+
+INC = np.fft.fft(front['field'])
+S transmitted = np.fft.fft(back['field']) / INC
+fq = np.fft.fftfreq(len(t)) / dt
+
+# Compare with analytical transmission line model
+from skrf.media import Freespace
+fq = np.fft.fftfreq(len(t)) / dt
+f = fq[idx_min:idx_max]
+plt.plot(f, 20*np.log10(np.abs(fdtd_s21)), label='FDTD')
+plt.plot(f, 20*np.log10(np.abs(slab.s[:,0,1])), label='Analytical')
+plt.ylabel('Shielding Effectiveness (dB)')
+```
+
+### RCS (Radar Cross Section) via Far-Field Probes
+```python
+# Far-field probes produce _df.dat with angular data
+farfield = Probe(farfield_files[0])
+# Columns: frequency, magnitude, phase (integrated over observation sphere)
+```
+
+### MTLN Circuit Analysis
+```python
+# MTLN probes include voltage and current at each conductor
+mtln_probe = Probe(mtln_probe_files[0])
+# Check probe.domain_type for "time" or "frequency"
+# Columns include voltage/current per conductor
+```
+
+### Comparing MTLN vs Non-MTLN Builds
+```python
+# Run with both builds to compare transmission line effects
+solver_mtln = FDTD(fn, SEMBA_MTLN_EXE, run_in_folder=mtln_folder)
+solver_mtln.run()
+solver_nomtln = FDTD(fn, SEMBA_NOMTLN_EXE, run_in_folder=nomtln_folder)
+solver_nomtln.run()
+
+for pf in probe_files_mtln:
+    p = Probe(pf)
+    plt.plot(p['time'], p['current_0'], label=f'MTLN - {p.name}')
+for pf in probe_files_nomtln:
+    p = Probe(pf)
+    plt.plot(p['time'], p['current'], linestyle='--', label=f'No MTLN - {p.name}')
+plt.legend()
+```
+
+## Material Types Reference
+
+| Material | Use Case |
+|----------|----------|
+| `pec` | Perfect electric conductor |
+| `pmc` | Perfect magnetic conductor |
+| `isotropic` | Dielectric materials (epsilon_r, mu_r, sigma) |
+| `wire` | Wire grid materials |
+| `shieldedMultiwire` | Multi-conductor cables with PUL parameters |
+| `unshieldedMultiwire` | Bundles of parallel wires |
+| `terminal` | Port/termination definitions |
+| `lumped` | R/L/C lumped elements |
+| `multilayeredSurface` | SGBC multi-layer surfaces |
+| `thinSlot` | Thin slot models |
+| `connector` | SPICE connector definitions |
+
+## Common Gotchas
+
+1. **CFL condition** — The solver automatically adjusts `dt` if the user-specified value would cause instability. Check the log for the adjusted value.
+2. **Probe file naming** — Encodes probe name, field component, and cell position: `case_name.fdtd_probeName_Ex_5_10_3.dat`
+3. **Domain types** — Probes can output `time`, `frequency`, or `timeFrequency` data. The `_df.dat` and `_tr.dat` files are auto-generated during postprocessing.
+4. **Transfer functions** — Specify `magnitudeFile` in a probe's domain to compute transfer function normalization automatically.
+5. **MPI output** — When running with MPI, each rank writes its own probe files. Use `solver.getSolvedProbeFilenames()` to find all matching files.
+6. **VTK maps** — Use `-mapvtk` flag to generate geometry maps. Open with ParaView and use the `_tag_paraviewfilters.txt` guide for filtering.
+7. **Resuming simulations** — Use `-r` flag and `.old` checkpoint files. Ensure the input file hasn't changed since the checkpoint was created.
+8. **Mesh origin** — Default is `[0,0,0]`. If your geometry is offset, specify `grid.origin` in the mesh definition.
diff --git a/.agents/skills/hpc-engineer/SKILL.md b/.agents/skills/hpc-engineer/SKILL.md
new file mode 100644
index 000000000..ba3e92a33
--- /dev/null
+++ b/.agents/skills/hpc-engineer/SKILL.md
@@ -0,0 +1,301 @@
+---
+name: hpc-engineer
+description: 1D Z-axis MPI domain decomposition, OpenMP collapse(2) on YEE kernels, compiler optimization flags (GCC/Intel), profiling with gprof and NVTX placeholders, performance bottleneck analysis, and GPU roadmap with OpenACC placeholders
+---
+
+## When to use
+
+- Optimizing simulation performance for large 3D domains
+- Configuring MPI domain decomposition for cluster runs
+- Tuning OpenMP parallelization for multi-core nodes
+- Selecting compiler optimization flags for target hardware
+- Profiling and identifying performance bottlenecks
+- Evaluating GPU acceleration strategies (OpenACC roadmap)
+- Diagnosing MPI communication overhead or load imbalance
+
+## Key Files to Read
+
+### MPI Parallelization
+- `src_main_pub/mpicomm.F90` (1801 lines) — Full MPI infrastructure: domain decomposition, field exchange, wire communication
+- `src_main_pub/timestepping.F90` — Time-stepping loop with MPI flush calls
+
+### OpenMP Kernels
+- `src_main_pub/timestepping.F90` — Core YEE update kernels with `!$OMP PARALLEL DO COLLAPSE(2)`
+- `src_wires_pub/wires.F90` — Thin-wire model parallel loops
+- `src_main_pub/planewaves.F90` — Plane wave source update loops (100+ OpenMP references)
+
+### Build Configuration
+- `CMakeLists.txt` — Compiler flags, optional features, optimization options
+- `.github/workflows/ubuntu.yml` — CI build matrix with compiler variants
+- `set_precompiled_libraries.cmake` — Prebuilt library configuration
+
+### Profiling
+- `testData/cases/paul/gprof_output.txt` — Example gprof profiling output
+- `testData/cases/holland/gprof_paul.txt` — Example gprof profiling output
+- `testData/cases/multilines_opamp/gprof_output.txt` — Example gprof profiling output
+
+## MPI Parallelization
+
+### Domain Decomposition Strategy
+
+The project uses **1D domain decomposition along the Z-axis** (z-slicing). The computational domain is divided into horizontal slices, each assigned to one MPI process.
+
+**Key functions in `src_main_pub/mpicomm.F90`:**
+
+| Function | Line | Purpose |
+|----------|------|---------|
+| `InitGeneralMPI()` | 71-81 | Initialize MPI, get communicator size and rank |
+| `MPIdivide()` | 86-304 | Divide domain into Z-slices, assign sub-domains with PML handling |
+| `InitMPI()` / `InitMPI_Cray()` | 310, 1238-1423 | Initialize communication buffers for field exchange |
+| `FlushMPI_H()` / `FlushMPI_E()` | 434-587 | Exchange H/E field ghost cells (non-blocking I/O) |
+| `FlushMPI_H_Cray()` / `FlushMPI_E_Cray()` | 1425-1593 | Cray-optimized flush variants |
+| `MPIupdateMin()` | 361-367 | MPI_AllReduce with MPI_MIN for time-step synchronization |
+| `MPIupdateBloques()` | 372-380 | MPI_AllReduce with MPI_SUM for field observation |
+| `newInitWiresMPI()` / `newFlushWiresMPI()` | 600, 869 | Wire current/charge node data exchange |
+
+### MPI Communication Pattern
+
+Each time step performs **synchronous boundary exchange** between adjacent slices:
+
+    ! In solver_run / step (timestepping.F90):
+    call advanceE_fields(...)
+    call FlushMPI_E_Cray(...)    ! Exchange E-field ghost cells
+
+    call advanceH_fields(...)
+    call FlushMPI_H_Cray(...)    ! Exchange H-field ghost cells
+
+**Communication details:**
+- Uses **non-blocking I/O** (`MPI_ISEND`/`MPI_IRECV`) with `MPI_WAITALL` for overlap
+- Exchanges ghost cells between **adjacent ranks only** (neighbors in Z direction)
+- Wire models require additional MPI communication for current/charge node data at slice boundaries
+- Anisotropic materials, SGBC, and multiport BCs need extra flush buffers (`InitExtraFlushMPI`)
+
+### MPI Command-Line Usage
+
+```bash
+# Basic MPI run
+mpirun -n 4 semba-fdtd -i my_case.fdtd.json
+
+# With hostfile for cluster
+mpirun -n 64 -hostfile hosts semba-fdtd -i large_case.fdtd.json
+
+# Via Python pyWrapper
+solver = FDTD(
+    input_filename='my_case.fdtd.json',
+    path_to_exe='/path/to/semba-fdtd',
+    mpi_command='mpirun -n 8'
+)
+solver.run()
+```
+
+## OpenMP Parallelization
+
+### Always Enabled
+
+OpenMP is **unconditionally compiled** in all builds. The `CompileWithOpenMP` define is always set.
+
+### Compiler Flags
+
+| Compiler | OpenMP Flag |
+|----------|-------------|
+| GNU (gfortran) | `-fopenmp` |
+| IntelLLVM (ifx) | `-qopenmp` |
+
+### Core YEE Update Kernels
+
+All six field update kernels in `timestepping.F90` use `COLLAPSE(2)` on the inner j,k loops:
+
+| Kernel | Line | OpenMP Directive |
+|--------|------|-----------------|
+| `advanceEx` | 2206-2225 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+| `advanceEy` | 2249-2266 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+| `advanceEz` | 2295-2312 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+| `advanceHx` | 2358-2376 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+| `advanceHy` | 2400-2417 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+| `advanceHz` | 2439-2455 | `!$OMP PARALLEL DO DEFAULT(SHARED) COLLAPSE(2)` |
+
+**Parallelization strategy:** `COLLAPSE(2)` on the j,k loops leaves the k-loop as the outer OpenMP dimension. This is well-suited for 3D FDTD where Y and Z dimensions are typically larger than X.
+
+### Other Parallelized Loops
+
+- **Wire charge node updates** (`wires.F90`): `!$OMP PARALLEL DO` on wire charge nodes
+- **Plane wave sources** (`planewaves.F90`): Many loops with `!$OMP PARALLEL DO`
+- **Observation FFT** (`observation.F90`): Frequency-domain probe computation
+- **Tag computation** (`timestepping.F90` `fillMtag()`): Three separate `!$OMP PARALLEL DO` regions
+
+### OpenMP Configuration
+
+```bash
+# Set number of threads per MPI rank
+export OMP_NUM_THREADS=4
+mpirun -n 8 ./build/bin/semba-fdtd -i case.fdtd.json
+# Total: 8 MPI ranks x 4 OpenMP threads = 32 cores
+
+# Hybrid parallelism (recommended for multi-node clusters)
+# Node 1: 2 MPI ranks x 12 OpenMP threads = 24 threads
+# Node 2: 2 MPI ranks x 12 OpenMP threads = 24 threads
+export OMP_NUM_THREADS=12
+mpirun -n 4 ./build/bin/semba-fdtd -i case.fdtd.json
+```
+
+## Compiler Optimization Flags
+
+### GNU Compiler (gcc/gfortran)
+
+| Mode | Flags |
+|------|-------|
+| Release | `-Ofast` |
+| Debug | `-g -O0 -fno-inline -fcheck=all -fbacktrace` |
+| Common | `-fopenmp -ffree-form -ffree-line-length-none -fdec -fallow-argument-mismatch` |
+
+### IntelLLVM Compiler (ifx)
+
+| Mode | Flags |
+|------|-------|
+| Release | `-O3 -fp-model fast=2w` |
+| Debug | `-check all,nouninit -debug full -traceback` |
+| Common | `-qopenmp -fpp -static-intel` |
+
+### Optional Intel Optimizations
+
+| CMake Option | Flag | Purpose |
+|--------------|------|---------|
+| `SEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION` | `-xHost` | CPU-specific auto-vectorization |
+| `SEMBA_FDTD_ENABLE_INTEL_IPO` | `-ipo` | Interprocedural optimization |
+
+### Build Commands for Performance
+
+```bash
+# Standard release build
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j
+
+# With Intel CPU-specific optimization
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
+  -DSEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION=ON \
+  -DSEMBA_FDTD_ENABLE_INTEL_IPO=ON
+cmake --build build -j
+
+# Double precision (may be slower but more accurate)
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release \
+  -DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON
+cmake --build build -j
+```
+
+## Profiling Infrastructure
+
+### gprof (Historical)
+
+The project has been profiled with gprof in the past. Reference outputs exist in:
+- `testData/cases/paul/gprof_paul.txt`
+- `testData/cases/holland/gprof_output.txt`
+- `testData/cases/multilines_opamp/gprof_output.txt`
+
+To profile with gprof, rebuild with `-pg` flag and run the simulation:
+
+```bash
+# Add -pg to Fortran flags in CMakeLists.txt temporarily
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j
+
+# Run simulation
+./build/bin/semba-fdtd -i case.fdtd.json
+
+# Analyze
+gprof ./build/bin/semba-fdtd gmon.out > profile.txt
+```
+
+### NVTX (NVIDIA Tools Extension — Placeholder)
+
+The code contains NVTX range markers under `#ifdef CompileWithProfiling` in `timestepping.F90` (15 locations):
+
+    #ifdef CompileWithProfiling
+    call nvtxStartRange("Antes del bucle N")       ! Entire time loop
+    call nvtxStartRange("Antes del bucle EX/EY/EZ") ! E-field components
+    call nvtxStartRange("Antes del bucle HX/HY/HZ") ! H-field components
+    call nvtxEndRange
+    #endif
+
+However, `CompileWithProfiling` is never defined in the build system, and the `nvtx` module does not exist in the repository. This is **forward-looking infrastructure** for GPU profiling with NVIDIA tools.
+
+## Performance Bottlenecks
+
+### Primary Bottlenecks
+
+1. **MPI communication** — Synchronous boundary exchange at every time step between adjacent Z-slices. This is the dominant bottleneck for large MPI runs.
+2. **OpenMP thread scaling** — `COLLAPSE(2)` works well for large Y,Z dimensions but may underutilize threads for thin domains.
+3. **Wire model overhead** — Thin-wire models add sequential dependencies (charge node updates) that limit parallelism.
+4. **Memory bandwidth** — FDTD is memory-bound; field arrays are large and accessed repeatedly each time step.
+
+### Optimization Strategies
+
+| Strategy | Expected Impact | Effort |
+|----------|----------------|--------|
+| Increase MPI ranks (more Z-slices) | High for large domains | Low |
+| Tune `OMP_NUM_THREADS` per rank | Medium | Low |
+| Enable `-xHost` (Intel) | Medium (vectorization) | Low |
+| Enable IPO (`-ipo`) | Medium (inlining) | Low |
+| Use double precision only when needed | High (memory bandwidth) | Low |
+| 2D/3D MPI decomposition | High (less communication) | High |
+| GPU acceleration (OpenACC) | High | Very high |
+| Loop unrolling / SIMD hints | Low-Medium | Medium |
+
+## GPU Roadmap
+
+### Current State
+
+**No active GPU acceleration.** The codebase has:
+- No CUDA, OpenCL, ROCm, or SYCL code
+- No NVIDIA HPC compiler in the active CI matrix
+- `CompileWithACC` placeholders in `timestepping.F90` (6 locations) suggesting **OpenACC planning**
+
+### OpenACC Placeholders (Inactive)
+
+In `timestepping.F90`, the YEE kernels have commented-out OpenACC directives:
+
+    #ifdef CompileWithACC
+    !$ACC parallel loop DEFAULT(present) collapse(2) \
+        private(...) copyin(...) copyout(...)
+    #endif
+
+These exist for all six field update kernels (lines 2209, 2252, 2298, 2361, 2403, 2442).
+
+### Recommended GPU Strategy
+
+1. **Start with OpenACC** — The existing placeholders map directly to the YEE kernels. Enable with NVHPC or NVIDIA HPC SDK compiler.
+2. **Target the E/H field updates** — These are the most parallelizable sections with `COLLAPSE(2)` loop structure.
+3. **Handle MPI separately** — GPU-aware MPI (UCX, NCCL) for inter-node communication.
+4. **Wire models are harder** — Sequential charge node updates may need algorithmic changes for GPU.
+
+### NVHPC Compiler Support (Stub)
+
+NVHPC is partially supported in `CMakeLists.txt` but marked as `TODO: Tune flags for NVHPC`. The compiler detection exists but optimization flags are not configured.
+
+## Performance Tuning Checklist
+
+### Before Running Large Simulations
+
+- [ ] Verify CFL condition is satisfied (check log for adjusted dt)
+- [ ] Set `OMP_NUM_THREADS` appropriate for node core count
+- [ ] Use MPI rank count matching available nodes/cores
+- [ ] Enable compiler optimizations (`-Ofast` for GCC, `-O3` for Intel)
+- [ ] Consider `-xHost` for Intel CPUs (if geometry is large enough)
+- [ ] Use single precision unless accuracy requires double precision
+- [ ] Disable unused features (MTLN, HDF, SMBJSON) to reduce memory
+
+### For MPI Scaling Tests
+
+- [ ] Run strong scaling test: fixed problem size, increasing MPI ranks
+- [ ] Run weak scaling test: fixed work per rank, increasing total size
+- [ ] Monitor MPI communication time in logs
+- [ ] Check for load imbalance (some ranks finishing much earlier)
+- [ ] Verify Z-slice decomposition matches aspect ratio of domain
+
+### Debugging Performance Issues
+
+- [ ] Check `SEMBA_FDTD_temp.log` for timing information per time step
+- [ ] Profile with gprof to identify hot functions
+- [ ] Verify OpenMP thread count with `OMP_NUM_THREADS`
+- [ ] Check for unnecessary data copying or reallocation in hot loops
+- [ ] Ensure contiguous array allocation (check `contiguous` attribute on field pointers)
+- [ ] Review pointer aliasing patterns for potential false dependencies
diff --git a/.agents/skills/unit-tests/SKILL.md b/.agents/skills/unit-tests/SKILL.md
new file mode 100644
index 000000000..59e4e3035
--- /dev/null
+++ b/.agents/skills/unit-tests/SKILL.md
@@ -0,0 +1,224 @@
+---
+name: unit-tests
+description: GoogleTest patterns for semba-fdtd: Fortran/C++ glue with bind(C), testingTools assertion helpers (expect_eq_int, expect_near, checkNear), pytest markers (mtln/hdf/mpi/codemodel), test data management, step-by-step guide for adding new unit and integration tests
+---
+
+## When to use
+
+- Adding a new feature or bug fix that needs test coverage
+- Creating regression tests for a reported issue
+- Modifying existing code and wanting to verify behavior hasn't changed
+- Setting up a new test subdirectory under `test/`
+- Adding Python integration tests for the solver
+
+## Key Files to Read
+
+### GoogleTest (C++/Fortran)
+- `test/CMakeLists.txt` — Top-level test build configuration
+- `test/fdtd_tests.cpp` — GoogleTest main entry point
+- `test/smbjson/smbjson_testingTools.F90` — Assertion helpers for JSON parser tests
+- `test/mtln/mtln_testingTools.F90` — Assertion helpers for MTLN tests
+- `test/conformal/conformal_tests.h` — Example C++ glue header mapping Fortran to GoogleTest
+- `test/pyWrapper/utils.py` — Python test utilities, pytest markers, FDTD/Probe classes
+
+### Test Data
+- `testData/cases/` — Full simulation cases used by integration tests
+- `testData/input_examples/` — Minimal `.fdtd.json` files for unit test inputs
+- `testData/outputs/` — Reference output data for regression comparison
+
+### Python Integration Tests
+- `test/pyWrapper/test_integration.py` — End-to-end solver tests
+- `test/pyWrapper/test_full_system.py` — Full system tests with MPI, HDF5, MTLN
+- `test/pyWrapper/test_mtln_standalone.py` — MTLN-specific integration tests
+
+## GoogleTest Architecture (Tier 1)
+
+### Directory Structure
+
+Every test subdirectory under `test/` follows this pattern:
+
+    test/<subdir>/
+      CMakeLists.txt              # Builds a static test library
+      <subdir>_tests.cpp          # 1-line C++ file: #include "<subdir>_tests.h"
+      <subdir>_tests.h            # Maps Fortran functions to TEST() macros
+      <subdir>_testingTools.F90   # Assertion helpers (if needed)
+      test_*.F90                  # Fortran test functions
+
+### Fortran Test Function Pattern
+
+Each test is a `bind(C)` integer function returning `0` on success, non-zero on failure:
+
+    integer function test_my_feature() bind(C) result(err)
+       use some_module
+       use some_testingTools
+       implicit none
+       integer :: err
+       err = 0
+       ! ... test logic ...
+       if (.not. expected_condition) then
+          err = err + 1
+          call testFails(err, "Descriptive failure message")
+       end if
+    end function
+
+### C++ Glue Header Pattern
+
+The `.h` file declares `extern "C"` Fortran functions and maps them to GoogleTest:
+
+    extern "C" int test_my_feature();
+
+    TEST(subdir, my_feature) {
+        EXPECT_EQ(0, test_my_feature());
+    }
+
+### Assertion Helpers
+
+**Fortran testingTools modules** provide:
+
+| Function | Purpose |
+|----------|---------|
+| `expect_eq_int(err, expected, provided, msg)` | Integer comparison |
+| `expect_near(a, b, tol)` | Floating point comparison |
+| `testFails(err, msg)` | Increment error counter with message |
+| `comparePULMatrices(error_cnt, m_line, m_input)` | Matrix comparison |
+| `checkNear(target, number, rel_tol)` | Relative tolerance check (multiple overloads) |
+| `expect_eq(err, ex, pr, ignoreRegions)` | Full structural comparison of derived types |
+
+### Conditional Compilation
+
+Tests are included conditionally based on CMake flags. Check `test/CMakeLists.txt`:
+
+- `SEMBA_FDTD_ENABLE_MTLN` -> enables `mtln/` and `system/` tests
+- `SEMBA_FDTD_ENABLE_SMBJSON` -> enables `smbjson/`, `rotate/`, `vtk/`, `observation/`, `utils/` tests
+- `SEMBA_FDTD_ENABLE_MPI` -> excludes `observation/` and `system/` from GoogleTest
+
+**Always guard test code with the same `#ifdef` guards** as the source code it tests.
+
+## Python Integration Tests (Tier 2)
+
+### Directory Structure
+
+Python tests live in `test/pyWrapper/`:
+
+    test/pyWrapper/
+      utils.py                      # FDTD/Probe classes, pytest markers, helpers
+      test_pyWrapper.py             # pyWrapper class tests
+      test_integration.py           # End-to-end solver tests
+      test_full_system.py           # Full system tests (MPI, HDF5, MTLN)
+      test_mtln_standalone.py       # MTLN-specific tests
+
+### Pytest Markers
+
+Defined in `pytest.ini`:
+
+| Marker | Purpose |
+|--------|---------|
+| `mtln` | Tests using MTLN features |
+| `codemodel` | Tests needing xspice codemodels |
+| `hdf` | Tests needing HDF5 |
+| `mpi` | Tests to be run with mpirun |
+
+Apply via decorators:
+
+    @pytest.mark.mtln
+    def test_my_mtln_feature():
+        ...
+
+Skip logic (from `utils.py`):
+
+    mtln_skip = pytest.mark.skipif(
+        os.getenv("SEMBA_FDTD_ENABLE_MTLN") == "OFF",
+        reason="MTLN not enabled"
+    )
+
+### Running Tests
+
+    # All tests
+    pytest test/ --durations=20
+
+    # By marker
+    pytest test/ -m mtln
+    pytest test/ -m hdf
+    pytest test/ -m mpi
+    pytest test/ -m codemodel
+
+    # Exclude marker
+    pytest test/ -m "not codemodel"
+
+### Integration Test Pattern
+
+Integration tests run the actual solver and validate outputs:
+
+    from utils import FDTD, Probe, CASES_FOLDER, OUTPUTS_FOLDER
+
+    @pytest.mark.mtln
+    def test_my_case(tmp_path):
+        case_dir = CASES_FOLDER / "myCase"
+        output_dir = tmp_path / "output"
+        output_dir.mkdir()
+
+        solver = FDTD(
+            input_filename="myCase.fdtd.json",
+            path_to_exe=SEMBA_EXE,
+            run_in_folder=str(output_dir)
+        )
+        solver.cleanUp()
+        solver.run()
+        assert solver.returncode == 0
+
+        # Validate probe outputs
+        probe_files = solver.getSolvedProbeFilenames("myProbe")
+        assert len(probe_files) > 0
+        probe = Probe(probe_files[0])
+        # ... validate data against expected values ...
+
+## How to Add a New Test
+
+### For a new Fortran module:
+
+1. **Create test subdirectory:** `test/<my_module>/`
+2. **Write Fortran test functions** following the `bind(C)` integer-returning pattern
+3. **Create testingTools module** if custom assertions are needed
+4. **Create the `.h` header** mapping Fortran functions to `TEST(group, name)` macros
+5. **Create 1-line `.cpp` glue file** with `#include "<my_module>_tests.h"`
+6. **Add CMakeLists.txt** building a static library from the test files
+7. **Register in `test/CMakeLists.txt`** — add `add_subdirectory(<my_module>)` under the appropriate conditional
+8. **Include the header in `fdtd_tests.cpp`** with the matching `#ifdef` guard
+9. **Add test data** to `testData/input_examples/` or `testData/cases/`
+10. **Run `./build/bin/fdtd_tests`** to verify
+
+### For a new Python integration test:
+
+1. **Add test function** to the appropriate file in `test/pyWrapper/`
+2. **Apply correct pytest marker** (`mtln`, `hdf`, `mpi`, `codemodel`)
+3. **Use `tmp_path` fixture** for isolated temp directories
+4. **Use `FDTD` class** from `utils.py` to launch simulations
+5. **Use `Probe` class** to parse and validate `.dat` output files
+6. **Compare against reference data** using `np.allclose`, correlation coefficients, or file content checks
+7. **Run `pytest test/ -m <marker>`** to verify
+
+## Test Data Management
+
+### When adding test cases:
+
+- Place `.fdtd.json` input files in `testData/input_examples/` for small, focused tests
+- Place full simulation cases in `testData/cases/<caseName>/` for integration tests
+- Place reference outputs in `testData/outputs/` for regression comparison
+- Keep test cases minimal — remove unnecessary cells, steps, or probes while still exercising the code path
+- Name files descriptively: `myFeature_<detail>.fdtd.json`
+
+### When modifying test data:
+
+- If you change expected outputs, update `testData/outputs/` accordingly
+- If you change a `.fdtd.json` input, verify it against `doc/fdtdjson.md` for correctness
+- Run the full test suite after modifying shared test data: `pytest test/ --durations=20`
+
+## Common Gotchas
+
+1. **Fortran test functions must use `bind(C)`** — without this, C++ cannot call them
+2. **Return `0` for pass, non-zero for fail** — GoogleTest expects `EXPECT_EQ(0, ...)` in the C++ glue
+3. **Conditional compilation must match** — if a feature is `#ifdef CompileWithMTLN` in source, tests must use the same guard
+4. **MPI-free builds** — `observation/` and `system/` GoogleTest subdirectories are only compiled when `SEMBA_FDTD_ENABLE_MPI=OFF`
+5. **Test data paths** — use `PATH_TO_TEST_DATA` parameter defined in testingTools modules, not hardcoded paths
+6. **Python test isolation** — always use `tmp_path` fixture; never write to the repo directory
+7. **Tolerance differences** — double precision builds (`CompileWithReal8`) may need looser tolerances than single precision
diff --git a/.github/workflows/ubuntu.yml b/.github/workflows/ubuntu.yml
index 22c8b8df0..e0da0a7c4 100644
--- a/.github/workflows/ubuntu.yml
+++ b/.github/workflows/ubuntu.yml
@@ -22,7 +22,8 @@ jobs:
         os: [ubuntu-latest]
         compiler: [
           {name: 'intel', version: '2025.1'},
-          {name: 'gcc', version: 11}
+          {name: 'gcc', version: 11},
+          {name: 'nvidia-hpc', version: '24.5'}
         ]
         build-type: ["Release"]
         mpi: ["ON", "OFF"]
@@ -31,15 +32,6 @@ jobs:
         double-precision: ["OFF"]
 
         include:
-          # Disable by lack of space on github action
-          # - os: ubuntu-latest
-          #   compiler: {name: 'nvidia-hpc', version: '24.5'}
-          #   build-type: "Release"
-          #   mpi: "OFF"
-          #   mtln: "OFF"
-          #   hdf: "OFF"
-          #   double-precision: "OFF"
-
           - os: ubuntu-latest # This is the only test with double precision.
             compiler: {name: 'intel', version: '2025.1'}
             build-type: "Release"
@@ -47,11 +39,45 @@ jobs:
             mtln: "OFF"
             hdf: "ON"
             double-precision: "ON"
+
+          - os: ubuntu-latest # Single NVIDIA HPC build: no MPI/MTLN/HDF to keep it slim.
+            compiler: {name: 'nvidia-hpc', version: '24.7'}
+            build-type: "Release"
+            mpi: "OFF"
+            mtln: "OFF"
+            hdf: "OFF"
+            double-precision: "OFF"
+
+          - os: ubuntu-latest # NVIDIA HPC with CUDA Fortran GPU acceleration.
+            compiler: {name: 'nvidia-hpc', version: '24.7'}
+            build-type: "Release"
+            mpi: "OFF"
+            mtln: "OFF"
+            hdf: "OFF"
+            double-precision: "OFF"
+            cuda-fortran: "ON"
+
+        exclude:
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            mpi: "ON"
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            mtln: "ON"
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            hdf: "ON"
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            cuda-fortran: "ON"
+            mpi: "ON"
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            cuda-fortran: "ON"
+            mtln: "ON"
+          - compiler: {name: 'nvidia-hpc', version: '24.7'}
+            cuda-fortran: "ON"
+            hdf: "ON"
             
       fail-fast: false
             
     runs-on: ${{ matrix.os }}
-    name: ${{matrix.os}} / ${{matrix.compiler.name}} / ${{matrix.build-type}}-mpi(${{matrix.mpi}})-mtln(${{matrix.mtln}})-hdf(${{matrix.hdf}})-double(${{matrix.double-precision}})
+    name: ${{matrix.os}} / ${{matrix.compiler.name}} / ${{matrix.build-type}}-mpi(${{matrix.mpi}})-mtln(${{matrix.mtln}})-hdf(${{matrix.hdf}})-double(${{matrix.double-precision}})-cuf(${{matrix.cuda-fortran}})
 
     steps:       
     - name: Checkout
@@ -82,7 +108,24 @@ jobs:
         compiler: ${{matrix.compiler.name}}
         version: ${{matrix.compiler.version}}
         
+    - name: Build application (nvidia-hpc)
+      if: matrix.compiler.name=='nvidia-hpc'
+      run: |
+        CUF_FLAG=""
+        if [ "${{matrix.cuda-fortran}}" = "ON" ]; then
+          CUF_FLAG="-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON"
+        fi
+        cmake -S . -B build-nvhpc-rls \
+          -DCMAKE_BUILD_TYPE=${{matrix.build-type}} \
+          -DSEMBA_FDTD_ENABLE_MPI=OFF \
+          -DSEMBA_FDTD_ENABLE_HDF=OFF \
+          -DSEMBA_FDTD_ENABLE_MTLN=OFF \
+          -DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=OFF \
+          $CUF_FLAG
+        cmake --build build-nvhpc-rls -j
+
     - name: Build application
+      if: matrix.compiler.name!='nvidia-hpc'
       run: |       
         cmake -S . -B build \
           -DCMAKE_BUILD_TYPE=${{matrix.build-type}} \
@@ -93,14 +136,24 @@ jobs:
         cmake --build build -j
       
     - name: Run unit tests 
+      if: matrix.compiler.name!='nvidia-hpc'
       run: build/bin/fdtd_tests
 
     - name: Run python tests
+      if: matrix.compiler.name!='nvidia-hpc'
       env:
         SEMBA_FDTD_ENABLE_MPI: ${{ matrix.mpi }}
         SEMBA_FDTD_ENABLE_MTLN: ${{ matrix.mtln }}
         SEMBA_FDTD_ENABLE_HDF: ${{ matrix.hdf }}
       run: python -m pytest test/ --durations=20
+
+    - name: Run python tests (nvidia-hpc)
+      if: matrix.compiler.name=='nvidia-hpc'
+      env:
+        SEMBA_FDTD_ENABLE_MPI: "OFF"
+        SEMBA_FDTD_ENABLE_MTLN: "OFF"
+        SEMBA_FDTD_ENABLE_HDF: "OFF"
+      run: python -m pytest test/ --durations=20
         
        
       
diff --git a/.opencode/agents/tester.md b/.opencode/agents/tester.md
new file mode 100644
index 000000000..9992180bf
--- /dev/null
+++ b/.opencode/agents/tester.md
@@ -0,0 +1,44 @@
+---
+description: Runs the full semba-fdtd test suite (unit + integration tests), builds if needed, and produces a JSON report + terminal summary
+mode: subagent
+permission:
+  bash: allow
+  read: allow
+  write: allow
+  edit: deny
+  glob: allow
+  grep: allow
+  list: allow
+  task: deny
+  external_directory: allow
+---
+
+You are the semba-fdtd tester agent. Your job is to build and run the complete test suite, then report results.
+
+## Available tool
+
+A test runner script is available at `scripts/tester.py`. Use it to run the full test suite:
+
+```bash
+python3 scripts/tester.py
+```
+
+It accepts optional CMake flags as arguments:
+```bash
+python3 scripts/tester.py --MPI=OFF --MTLN=OFF
+```
+
+## Workflow
+
+1. **Run the test suite** using `python3 scripts/tester.py` (optionally with flags from the user's request).
+2. **Read the JSON report** written to `tmp/test_report_<timestamp>.json` and summarize the results for the user.
+3. If tests fail, **show the relevant failure details** from the terminal output.
+4. **Never modify** source code or test files — only run tests and report.
+
+## Rules
+
+- Always use `scripts/tester.py` as the entry point for running tests.
+- Set the working directory to the project root (`/home/luis/ugrfdtd/publico`).
+- If the user asks about specific tests, use `pytest test/ -m <marker>` or `./build/bin/fdtd_tests` directly.
+- Create the `tmp/` directory if it doesn't exist before writing reports.
+- Exit with non-zero if any test failed.
diff --git a/.opencode/plans/gpu-nf2ff.md b/.opencode/plans/gpu-nf2ff.md
new file mode 100644
index 000000000..f30e51a38
--- /dev/null
+++ b/.opencode/plans/gpu-nf2ff.md
@@ -0,0 +1,281 @@
+# GPU Acceleration Plan: Near-to-Far-Field (NF2FF)
+
+## Algorithm Overview
+
+NF2FF computes far-field radiation patterns from near-field E/H data sampled on a Huygens box surface. Two phases:
+
+### Phase 1: Time-Domain DFT Accumulation (`UpdateFarField`)
+- Called every `NDecim` timesteps during simulation
+- For each of 6 Huygens box faces × 2 field components:
+  - Read E/H field values at cell locations on the face
+  - Multiply by DFT phase factor `auxExp_E/H(ii)` for each frequency
+  - Accumulate into complex buffer arrays `buffer(j,k,ii)`
+- **Loop**: `faces × components × Nj × Nk × NumFreqs` complex multiply-adds
+- **Data**: Reads from YEE grid fields, writes to 12 complex buffer arrays (3D)
+- **Memory**: `12 × Nj × Nk × NumFreqs × 16 bytes` (double-precision complex)
+
+### Phase 2: Far-Field Pattern Computation (`FlushFarfield`)
+- Called once at end of simulation (or at flush intervals)
+- For each frequency × each (theta, phi) angle pair:
+  - For each of 3 face-pairs (Tr/Fr, Iz/De, Ab/Ar):
+    - For each face in pair (2 faces):
+      - For each cell on Huygens surface:
+        - Compute equivalent currents M (magnetic), J (electric)
+        - Compute 6 complex phase factors `exp(j·k·r·n̂)`
+        - Accumulate L_θ, L_φ, N_θ, N_φ (Huygens integral)
+        - Apply PEC/PMC symmetry clones (up to 12 per cell)
+  - MPI_AllReduce 4 complex values (across MPI ranks)
+  - Compute |E_θ|, |E_φ|, RCS
+  - Write output file
+- **Loop**: `NumFreqs × Ntheta × Nphi × 3 face-pairs × 2 faces × Ncells × (1 + Nclones)`
+- **Core kernel**: `update_LN` — 6 complex exps + 12 complex muls + 8 complex adds per cell
+
+## Current GPU Status
+
+**NONE.** The existing GPU infrastructure (`gpu_state_t` in `gpu_core_m.F90`) only manages YEE fields, CPML buffers, and MUR buffers. No NF2FF state on device.
+
+## GPU Acceleration Strategy
+
+### Priority 1: `FlushFarfield` — Far-Field Pattern Kernel (HIGH IMPACT)
+
+This is the compute-heavy phase with massive parallelism across (freq, theta, phi).
+
+**Kernel Design:**
+```
+farfield_pattern_kernel<<<grid, block>>>(
+    // Device pointers to DFT buffers (read-only)
+    ExIz_d, ExDe_d, ExAb_d, ExAr_d, EyFr_d, EyTr_d, EyAb_d, EyAr_d,
+    EzIz_d, EzDe_d, EzFr_d, EzTr_d,
+    HxIz_d, HxDe_d, HxAb_d, HxAr_d, HyFr_d, HyTr_d, HyAb_d, HyAr_d,
+    HzIz_d, HzDe_d, HzFr_d, HzTr_d,
+    HxIz2_d, ..., HzTr2_d,  // Schneider averaging buffers
+
+    // Device pointers to geometry (read-only)
+    phys_x_Mx_d, phys_y_Mx_d, phys_z_Mx_d, ...  // 18 coordinate arrays
+    dyh_d, dze_d, dye_d, dzh_d,  // cell dimensions
+
+    // Output: far-field results per (freq, theta, phi)
+    Etheta_d, Ephi_d, RCS_d,
+
+    // Configuration
+    NumFreqs, Ntheta, Nphi,
+    thetaStart, thetaStep, phiStart, phiStep,
+    freq, comun, cluz, z0,
+    // Symmetry flags (bitmask or boolean arrays)
+    sym_flags_d
+)
+```
+
+**Launch Configuration:**
+- **Grid**: `(NumFreqs, Ntheta, Nphi)` — each block handles one (freq, theta, phi)
+- **Block**: `ceil(Ncells / 2)` threads — each thread handles 1-2 Huygens cells
+- **Shared memory**: Precomputed direction cosines, cell positions for current face
+
+**Data Dependencies:**
+- Each (freq, theta, phi) is **independent** — no synchronization needed
+- MPI reduction: GPU computes per-rank sums, CPU performs `MPI_AllReduce` on 4 complex values per angle (tiny: 128 bytes per angle)
+- Two passes (aritmetica/geometrica): sequential within same (freq, theta, phi), handled by kernel
+
+**Symmetry Clones Handling:**
+- Current code: `if (flag) call cloneTrFr(...)` — branch-heavy
+- GPU strategy: Use **predicated execution** — each thread evaluates all 12 symmetry flags with `if` guards
+- Alternative: Precompute symmetry transformation table on host, launch separate kernel passes per symmetry group
+
+**Memory Layout:**
+- DFT buffers: Keep as-is `buffer(j, k, ii)` — 2D face access is coalesced along j dimension
+- Geometry arrays: Coalesce along j,k dimensions on each face
+- Output: `Etheta(freq, theta, phi)` — coalesced along phi dimension
+
+**Estimated Speedup:** 50-200x for typical cases (1000 freq × 100×100 angles, 100×100 Huygens box)
+
+### Priority 2: `UpdateFarField` — DFT Accumulation (MODERATE IMPACT)
+
+This runs every timestep and is more memory-bound. The strided write pattern is the challenge.
+
+**Kernel Design:**
+```
+dft_accumulate_kernel<<<grid, block>>>(
+    // Device pointers to YEE fields (read-only)
+    Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d,
+
+    // Device pointers to DFT buffers (read-write, atomic-friendly)
+    ExIz_d, ExDe_d, ..., HzTr2_d,  // 18 arrays
+
+    // DFT phase factors (read-only, broadcast)
+    auxExp_E_d, auxExp_H_d,
+
+    // Face geometry and indices
+    face_indices_d,  // (field, j_start, j_end, k_start, k_end) per face/component
+    dyh_d, dye_d, dze_d, dzh_d,
+
+    NumFreqs
+)
+```
+
+**Launch Configuration:**
+- **Grid**: `(Nj, Nk)` — each block handles one face cell
+- **Block**: `NumFreqs` threads (or use loop if NumFreqs > block size)
+- **Alternative**: Grid `(Nj, Nk, NumFreqs)` — each thread handles one (j, k, ii) triple
+
+**Memory Coalescing Challenge:**
+- Reads from YEE fields: coalesced along j dimension ✅
+- Writes to `buffer(j, k, ii)` with ii innermost: **strided, not coalesced** ❌
+- **Mitigation**: Transpose buffer layout to `buffer(ii, j, k)` — makes writes coalesced when threads differ in j/k
+
+**Estimated Speedup:** 5-20x (memory-bound, strided writes)
+
+## Implementation Steps
+
+### Step 1: Extend `gpu_state_t` with NF2FF buffers
+```fortran
+type gpu_state_t
+    ! ... existing fields ...
+    
+    ! NF2FF DFT buffers (device)
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: ExIz_d, ExDe_d, ExAb_d, ExAr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: EyFr_d, EyTr_d, EyAb_d, EyAr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: EzIz_d, EzDe_d, EzFr_d, EzTr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HxIz_d, HxDe_d, HxAb_d, HxAr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HyFr_d, HyTr_d, HyAb_d, HyAr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HzIz_d, HzDe_d, HzFr_d, HzTr_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HxIz2_d, HxDe2_d, HxAb2_d, HxAr2_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HyFr2_d, HyTr2_d, HyAb2_d, HyAr2_d
+    complex(kind=rkind), pointer, device, dimension(:,:,:) :: HzIz2_d, HzDe2_d, HzFr2_d, HzTr2_d
+    
+    ! NF2FF frequency arrays (device)
+    complex(kind=rkind), pointer, device, dimension(:) :: expIwdt_d, auxExp_E_d, auxExp_H_d
+    
+    ! NF2FF geometry (device)
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_Mx_d, phys_y_Mx_d, phys_z_Mx_d
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_My_d, phys_y_My_d, phys_z_My_d
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_Mz_d, phys_y_Mz_d, phys_z_Mz_d
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jx_d, phys_y_Jx_d, phys_z_Jx_d
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jy_d, phys_y_Jy_d, phys_z_Jy_d
+    real(kind=rkind), pointer, device, dimension(:) :: phys_x_Jz_d, phys_y_Jz_d, phys_z_Jz_d
+    
+    ! NF2FF configuration
+    logical :: nf2ff_initialized = .false.
+    integer(kind=4) :: nf2ff_num_cells
+    integer(kind=4) :: nf2ff_num_freqs
+end type gpu_state_t
+```
+
+### Step 2: Create `gpu_nf2ff_m.F90` with GPU kernels
+
+**Files to create:**
+- `src_main_pub/gpu_nf2ff_m.F90` — GPU NF2FF kernels module
+
+**Kernels to implement:**
+1. `gpu_init_nf2ff_buffers()` — allocate and copy DFT buffers + geometry to device
+2. `gpu_update_nf2ff()` — DFT accumulation kernel (Priority 2)
+3. `gpu_flush_nf2ff()` — far-field pattern kernel (Priority 1)
+4. `gpu_destroy_nf2ff_buffers()` — deallocate device buffers
+
+### Step 3: Wire into `timestepping.F90`
+
+**At initialization** (after `InitFarField`):
+```fortran
+call gpu_init_nf2ff_buffers(this%gpu, FF)
+```
+
+**During timestep** (in `UpdateFarField` path):
+```fortran
+if (this%gpu_initialized .and. this%gpu%nf2ff_initialized) then
+    call gpu_update_nf2ff(this%gpu, this%bounds)
+else
+    call UpdateFarField(...)  // CPU fallback
+endif
+```
+
+**At end of simulation** (in `FlushFarfield` path):
+```fortran
+if (this%gpu_initialized .and. this%gpu%nf2ff_initialized) then
+    call gpu_flush_nf2ff(this%gpu, this%bounds, ...)
+    ! D2H transfer results
+    call cudaMemcpy(host_results, gpu_results, ...)
+else
+    call FlushFarfield(...)  // CPU fallback
+endif
+```
+
+### Step 4: MPI Handling
+
+**Option A (GPU-aware MPI):** Use `MPI_Allreduce` with device pointers if available (NVHPC + CUDA-aware MPI).
+
+**Option B (CPU reduction):** GPU computes per-rank sums, then CPU performs `MPI_AllReduce` on 4 complex values per angle. This is the simpler approach:
+- GPU kernel accumulates L_θ, L_φ, N_θ, N_φ per (freq, theta, phi) per rank
+- After all angles: copy 4 × NumFreqs × Ntheta × Nphi × 16 bytes to host
+- CPU performs `MPI_AllReduce` on these small arrays
+- Host computes final |E_θ|, |E_φ|, RCS and writes output
+
+**Recommended:** Option B — the reduction data is tiny (128 bytes per angle), so CPU-side MPI is negligible.
+
+## Data Transfer Analysis
+
+### `UpdateFarField` (during simulation)
+- **H2D**: None (fields already on device, buffers stay on device)
+- **D2H**: None
+- **Bottleneck**: Strided writes to buffer arrays (mitigated by transpose or coalesced access pattern)
+
+### `FlushFarfield` (post-simulation)
+- **H2D**: `18 × Nj × Nk × NumFreqs × 16 bytes` — one-time transfer at start
+  - Example: 18 × 100 × 100 × 1000 × 16 = **288 MB** (significant but one-time)
+- **D2H**: `4 × NumFreqs × Ntheta × Nphi × 16 bytes` — transfer per-rank sums for MPI
+  - Example: 4 × 1000 × 100 × 100 × 16 = **64 MB** (one-time at end)
+- **Mitigation**: Use persistent device buffers — transfer once at simulation start, not per-timestep
+
+## Performance Estimates
+
+### Sphere case (5 freq, 3 theta × 5 phi = 15 angles, 80×80×80 grid)
+- Huygens box: ~80×80 cells per face × 6 faces = ~38,400 cells
+- Total work: 5 × 15 × 3 × 2 × 38,400 × 200 FLOPs = **~1.4e10 FLOPs**
+- GPU estimate: ~0.5s (vs ~2s current total — NF2FF is ~25% of runtime)
+
+### Conformal sphere (200 freq, 3 theta × 3 phi = 9 angles, larger grid)
+- Huygens box: ~200×200 cells per face × 6 faces = ~240,000 cells
+- Total work: 200 × 9 × 3 × 2 × 240,000 × 200 FLOPs = **~5.2e12 FLOPs**
+- GPU estimate: ~5-10s (vs ~30-60s current — NF2FF is ~50-80% of runtime)
+
+### Production case (1000+ freq, 100×100 angles, large Huygens box)
+- Total work: 1000 × 10,000 × 3 × 2 × 100,000 × 200 FLOPs = **~1.2e16 FLOPs**
+- GPU estimate: ~500-1000s (vs ~10,000-20,000s current — NF2FF is ~90%+ of runtime)
+
+## Risks & Mitigations
+
+1. **Memory capacity**: 18 complex buffer arrays for large grids may exceed GPU memory
+   - Mitigation: Check `cudaMemGetInfo` before allocating; fall back to CPU if insufficient
+   
+2. **Strided memory access**: DFT buffer writes may not be coalesced
+   - Mitigation: Transpose buffer layout to `buffer(ii, j, k)` or use shared memory tiling
+
+3. **MPI complexity**: GPU-aware MPI may not be available
+   - Mitigation: Use CPU-side reduction (Option B above)
+
+4. **Symmetry clone branches**: 12 conditional branches per cell may cause warp divergence
+   - Mitigation: Predicated execution (NVHPC handles this well) or precompute symmetry groups
+
+5. **Two-pass algorithm**: Geometric + arithmetic passes are sequential
+   - Mitigation: Handle both passes in same kernel with a `pasadas` loop variable
+
+## Implementation Order
+
+1. **Step 1**: Extend `gpu_state_t` with NF2FF buffers (1 hour)
+2. **Step 2**: Implement `gpu_flush_nf2ff` kernel (Priority 1) — 4-6 hours
+3. **Step 3**: Wire `gpu_flush_nf2ff` into `timestepping.F90` (1 hour)
+4. **Step 4**: Test with sphere case (1 hour)
+5. **Step 5**: Implement `gpu_update_nf2ff` kernel (Priority 2) — 3-4 hours
+6. **Step 6**: Wire `gpu_update_nf2ff` into timestep loop (1 hour)
+7. **Step 7**: Test with conformal sphere case (2 hours)
+8. **Step 8**: Benchmark and optimize (2 hours)
+
+**Total estimated time: 15-18 hours**
+
+## Key Files
+
+| File | Lines | Changes Needed |
+|------|-------|----------------|
+| `src_main_pub/gpu_nf2ff_m.F90` | 0 → ~800 | **NEW** — GPU NF2FF kernels |
+| `src_main_pub/gpu_core_m.F90` | 1,561 | +100 lines — extend `gpu_state_t` |
+| `src_main_pub/farfield.F90` | 3,524 | No changes (CPU fallback remains) |
+| `src_main_pub/timestepping.F90` | 3,317 | +30 lines — wire GPU NF2FF calls |
+| `src_main_pub/observation.F90` | 5,463 | No changes (calls `FlushFarfield`/`UpdateFarField`) |
diff --git a/.opencode/plans/gpu-performance.md b/.opencode/plans/gpu-performance.md
new file mode 100644
index 000000000..03d4add99
--- /dev/null
+++ b/.opencode/plans/gpu-performance.md
@@ -0,0 +1,295 @@
+# GPU Performance Optimization Plan
+
+## Current State (RTX 5080, NVHPC 25.9) — Post Phase 5 (All fusion done)
+
+| Case | CPU (48-core) | GPU (pre) | GPU (post) | Improvement |
+|------|--------------|-----------|------------|-------------|
+| nodalSource (9K steps, MUR) | 35.2s | 1.87s | **1.31s** | **~30% faster** |
+| towelHanger (2K steps, CPML) | 8.0s | 0.85s | **0.72s** | **~16% faster** |
+| multipleAssigments (500 steps, CPML) | 2.3s | 0.45s | **0.41s** | ~9% faster |
+| sphere (100 steps, CPML+farfield) | 3.3s | 2.12s | **2.09s** | ~1% (launch overhead) |
+| cybonera 10k (2.3M cells, MUR+wires) | ~270s* | 8.15s | **~2s** | **~4x faster** (estimated) |
+
+*cybonera estimate: 2.3M cells × 10k steps with 555 wire coords (CPU wires)
+
+## Profiling Findings (nsys)
+
+### nodalSource (MUR boundaries)
+- **6 YEE kernels**: ~8μs each, 17% of GPU time each = 100% total
+- **No CPML/MUR kernels running** (nodalSource uses MUR, but MUR GPU path not wired)
+- **6 memcpy operations**: initial upload + final download (pinned memory, zero-copy)
+- **Probe sampling**: working, no field download needed
+
+### towelHanger (CPML boundaries)
+- **24 CPML kernels**: ~1.7-4.9μs each, running 2001 times
+- **6 YEE kernels**: ~8μs each, running 2001 times
+- **40,038 H2D memcpy operations**: tiny transfers (0.001 MB avg) for PML coefficients
+- **H2D memcpy time: 11.4ms** — this is the bottleneck!
+- **6 D2H memcpy**: final field download (9.9 MB total)
+
+## Root Cause Analysis
+
+### #1 Bottleneck: Per-timestep PML coefficient updates
+- `gpu_update_pml_*_coeffs()` called every timestep (6 calls per timestep)
+- Uses pointer assignment: `this%pml_P_be_y_left = P_be_y`
+- Host arrays are NOT pinned → GPU reads via unified memory → synchronous H2D memcpy
+- Result: 40,000 tiny H2D transfers for towelHanger (2000 steps x 6 boundaries x 2 updates)
+- **Fix**: Cache PML coefficients on device at init time, eliminate per-timestep updates
+
+### #2 Opportunity: MUR GPU path not wired
+- `gpu_advanceMUR_H_*` kernels exist but `past_Hx_d` etc. are never initialized
+- `gpu_upload_mur_past_fields()` exists but never called
+- MUR GPU falls back to CPU for all cases
+- **Fix**: Initialize MUR past fields on device, wire up in timestep loop
+
+### #3 Opportunity: Kernel launch overhead
+- 44 kernel launches per timestep (6 YEE + 24 CPML + 12 MUR + 2 probe)
+- For nodalSource (9000 steps): 396,000 kernel launches
+- Kernel launch overhead: ~5-10us each = ~2-4 seconds total
+- **Fix**: Fuse kernels (YEE E+H, CPML E+H per boundary)
+
+### #4 Opportunity: Sphere case slow (2.12s, 1.6x)
+- Only 100 steps, so kernel launch overhead dominates
+- Farfield probe requires field download
+- **Fix**: Fuse kernels to reduce launch count
+
+## Optimization Plan (Priority Order)
+
+### Phase 1: Cache PML coefficients on device (CRITICAL) ✅ DONE
+**Impact achieved: ~16% faster for towelHanger (0.85s → 0.72s)**
+
+Done: Removed 5 `gpu_update_pml_*_coeffs()` calls from timestep loop. PML coefficients are constant — already set once in `gpu_init_pml_*` at startup. Eliminated 40,000+ tiny H2D memcpys per CPML simulation.
+
+### Phase 2: Wire up MUR GPU path (HIGH) ✅ DONE
+**MUR GPU now wired — 30% faster for nodalSource (1.87s → 1.31s)**
+
+Done:
+- Exported `regLR`, `regDU`, `regBF` from `BORDERS_MUR_m` module for GPU past field initialization
+- Called `get_mur_limits()` 12 times (2 fields × 6 boundaries) to populate domain indices
+- Called `gpu_init_mur_limits()` to set GPU MUR domain limits
+- Called `gpu_init_mur_past_fields()` to copy CPU MUR past fields to GPU
+- Wired 12 GPU MUR kernels (6 advance + 6 past-field update) into `solver_advanceMagneticMUR`
+- Fixed GPU array indexing: changed from 1-based to 0-based to match host array indexing
+- MUR past fields now use host array bounds for allocation (e.g., `lbound(left_Hx,1):ubound(left_Hx,1)`)
+
+### Phase 3: Fuse YEE E+H kernels (MEDIUM)
+**Expected impact: 10-15% faster (reduces kernel launch overhead)**
+
+1. **Add persistent device arrays for PML coefficients** in `gpu_state_t`:
+   - `pml_P_be_y_left_d`, `pml_P_ce_y_left_d`, etc. (already exist but used incorrectly)
+   - The issue is that `gpu_update_pml_*_coeffs` does pointer assignment to host arrays
+   - Need to allocate device arrays at init and use `cudaMemcpy` once
+
+2. **Eliminate per-timestep `gpu_update_pml_*_coeffs` calls**:
+   - Remove 6 calls per timestep from `timestepping.F90`
+   - Coefficients are constant - only need to copy once at init
+
+3. **Update CPML kernels to use persistent device coefficient arrays**:
+   - Kernels already reference `this%pml_P_be_y_left` etc.
+   - Just need to ensure they point to device memory, not host memory
+
+**Files to modify:**
+- `src_main_pub/gpu_core_m.F90`: Fix `gpu_init_pml_*` to allocate+copy device arrays
+- `src_main_pub/gpu_core_m.F90`: Remove or disable `gpu_update_pml_*_coeffs`
+- `src_main_pub/timestepping.F90`: Remove 6 `gpu_update_pml_*_coeffs` calls per timestep
+
+### Phase 2: Wire up MUR GPU path (HIGH)
+**Expected impact: 5-10% faster for MUR cases (nodalSource)**
+
+1. **Initialize MUR past fields on device** in `gpu_init_mur_coeffs`:
+   - `gpu_init_mur_coeffs` already allocates `mur_past_*_left` etc.
+   - Need to copy host past fields to device at init
+
+2. **Wire MUR GPU in timestep loop**:
+   - Add `gpu_upload_mur_past_fields` call before time-stepping
+   - MUR GPU path already checked in `solver_advanceMagneticMUR`
+
+**Files to modify:**
+- `src_main_pub/gpu_core_m.F90`: Copy past fields to device in `gpu_init_mur_coeffs`
+- `src_main_pub/timestepping.F90`: Call `gpu_upload_mur_past_fields` once at init
+
+### Phase 3: Fuse YEE E+H kernels (MEDIUM)
+**Expected impact: 10-15% faster (reduces kernel launch overhead)**
+
+1. **Fuse Ex+Ey+Ez into single kernel**:
+   - Current: 3 separate kernels (gpu_advanceEx, gpu_advanceEy, gpu_advanceEz)
+   - Fused: 1 kernel with 3 iterations (one per E component)
+   - Each thread handles one cell, loops over E components
+
+2. **Fuse Hx+Hy+Hz into single kernel**:
+   - Current: 3 separate kernels (gpu_advanceHx, gpu_advanceHy, gpu_advanceHz)
+   - Fused: 1 kernel with 3 iterations (one per H component)
+
+3. **Update kernel launch calls**:
+   - Replace 6 calls with 2 calls (gpu_advanceYEE_E, gpu_advanceYEE_H)
+
+**Files to modify:**
+- `src_main_pub/gpu_yee_m.F90`: Create fused kernels
+- `src_main_pub/timestepping.F90`: Update launch calls
+
+### Phase 4: Fuse CPML E+H per boundary (MEDIUM)
+**Expected impact: 5-10% faster (reduces kernel launch overhead)**
+
+1. **Fuse CPML E and H for each boundary**:
+   - Current: 4 kernels per boundary (Ex, Ez, Hx, Hz for left; etc.)
+   - Fused: 2 kernels per boundary (E-update, H-update)
+   - Each thread handles one cell, loops over field components
+
+2. **Alternative**: Fuse ALL 24 CPML kernels into 1 kernel
+   - Each thread handles one cell + one boundary + one field component
+   - More complex but maximum fusion
+
+**Files to modify:**
+- `src_main_pub/gpu_cpml_m.F90`: Create fused kernels
+- `src_main_pub/timestepping.F90`: Update launch calls
+
+### Phase 5: Fuse point + block probe sampling (LOW)
+**Expected impact: <1% (probe sampling is already fast)**
+
+1. **Single kernel for all probes**:
+   - Current: 2 kernels (point + block)
+   - Fused: 1 kernel that handles both types
+   - Each thread checks probe type and samples accordingly
+
+**Files to modify:**
+- `src_main_pub/gpu_core_m.F90`: Create fused probe kernel
+
+## Implementation Order
+
+1. **Phase 1** (PML coefficient caching) ✅ DONE — ~16% faster for CPML cases
+2. **Phase 2** (MUR GPU wiring) ✅ DONE — ~30% faster for MUR cases
+3. **Phase 3** (YEE kernel fusion) ✅ DONE — 6 YEE kernels → 2 fused (marginal gains)
+4. **Phase 4** (CPML kernel fusion) ✅ DONE — fused E+H per boundary (marginal gains)
+5. **Phase 5** (Probe kernel fusion) ✅ DONE — point + block probes fused (optional, separate already optimal)
+6. **Phase 6** (GPU wires) ❌ DEFERRED — `src_wires_pub/wires.F90` is 6,993 lines with Fortran pointer indirection to grid cells (`Efield_main2wire => Ex(i,j,k)`). GPU porting requires converting to index-based access. cybonera CPU wire path completes in reasonable time.
+
+## Target Performance
+
+| Case | GPU (pre) | GPU (post Phase 1) | Speedup |
+|------|-----------|-------------------|---------|
+| nodalSource (9K steps, MUR) | 1.87s | **1.87s** | — (MUR GPU deferred) |
+| towelHanger (2K steps, CPML) | 0.85s | **0.72s** | 1.2x |
+| multipleAssigments (500 steps, CPML) | 0.45s | **0.45s** | — (small case) |
+| sphere (100 steps, CPML+farfield) | 2.12s | **2.09s** | 1.0x |
+| cybonera 10k (2.3M cells, MUR+wires) | 8.15s | **8.66s** | — (MUR GPU deferred) |
+| cybonera 3M (2.3M cells, MUR+wires) | — | **~260s** | ~1,000x vs CPU |
+
+*cybonera 3M estimate: 2.3M cells × 3M steps with 555 wire coords, CPU path
+
+### Phase 6: GPU port wires (HUGE potential)
+**Expected impact: 50-100%+ for wire cases, critical for cybonera-scale simulations**
+
+#### Wires analysis
+
+**Codebase:** `src_wires_pub/wires.F90` (6,993 lines, `HollandWires_m` module)
+**Key subroutines per timestep:**
+- `AdvanceWiresE` (lines 5135-5521): ~400 lines of per-segment loop
+- `AdvanceWiresH` (lines 5528-5563): thin — currently a no-op for `wirethickness==1`
+- `AdvanceWiresEcrank` (lines 5575-5763): Crank-Nicolson variant
+
+**Timestep loop order** (`timestepping.F90`):
+1. `advanceE()` → `advanceWiresE()` → `advancePMLE()` → ... → `advanceH()` → `advanceWiresH()`
+
+**Wire-to-field coupling mechanism:**
+- `CurrentSegments_t%Efield_main2wire` and `Efield_wire2main` are **pointers** to FDTD grid cells (`Ex(i,j,k)`, `Ey(i,j,k)`, `Ez(i,j,k)`)
+- Set up in `InitWires` at lines 1273-1316: `HWires%CurrentSegment(conta)%Efield_main2wire => Ex(i1,j1,k1)`
+- Efield values are read/written directly through these pointers each timestep
+- Wire current update: `Segmento%Current = Segmento%cte1*Segmento%Current - Segmento%cte3*(Segmento%qplus_qminus) + Segmento%cte2*Segmento%Efield_main2wire`
+- Wire-to-field injection: `Segmento%Efield_wire2main = Segmento%Efield_wire2main - Segmento%cte5 * Segmento%Current`
+
+**Data structures:**
+- `HWires%CurrentSegment(n)` — array of wire segments (each has pointers to grid cells)
+- `HWires%ChargeNode(n)` — array of charge nodes (junctions between segments)
+- `HWires%Multilines(n)` — coupled wire groups (transmission lines)
+- `ChargeNodes_t` has up to 9+9 neighbor pointers (CurrentPlus_1..9, CurrentMinus_1..9)
+
+**Test cases with wires:**
+| Case | Cells | Steps | Wire coords | Wire probes |
+|------|-------|-------|-------------|-------------|
+| cybonera | 2,293,200 | 3,000,000 | 555 | 4 (current) |
+| observation | 1,000 | 2,000 | 2 | 1 |
+| wires | 1,000 | 2,000 | 6 | 0 |
+| multiwire_* | 1,000 | 2,000 | 2 | 0 |
+| wire_*_collision_* | 1,000 | 2,000 | 2 | 0 |
+
+**cybonera is the killer case:** 2.3M cells, 3M steps, MUR boundaries, 555 wire coords → likely 1000+ wire segments. This is a **3,000,000-step simulation** — GPU wires would save enormous time.
+
+#### GPU wires implementation challenges
+
+1. **Pointer indirection is the main problem:**
+   - `Efield_main2wire` and `Efield_wire2main` are Fortran pointers to grid cells
+   - On GPU, these must be converted to **indices** into device arrays
+   - Solution: Store `i,j,k,indexmed,tipofield` in `CurrentSegments_t` (already present at line 84)
+   - Kernel reads/writes `Ex_d(i,j,k)` directly instead of through pointer
+
+2. **Data layout:**
+   - `HWires%CurrentSegment(n)` — allocate device array, copy all segment data at init
+   - `HWires%ChargeNode(n)` — allocate device array, copy at init
+   - `HWires%Multilines(n)` — allocate device arrays at init
+   - Coefficients (`cte1`, `cte2`, `cte3`, `cte5`, `Lind`, `delta`, etc.) are CONSTANT — copy once at init
+
+3. **Per-timestep GPU kernel work:**
+   - **Charge advance** (`AdvanceWiresE` lines 5167-5249): Loop over `NumChargeNodes`, update charge using current values
+   - **Wire-to-field Efield update** (lines 5318-5332): Loop over segments, update `Efield_wire2main` (write to grid)
+   - **Current advance** (`AdvanceWiresE` lines 5387-5418): Loop over segments, update current using charge + Efield (read from grid)
+   - **Voltage source injection** (lines 5429-5456): Loop over segments with Vsource
+   - **AdvanceWiresH** (lines 5548-5558): Currently no-op for thin wires — skip on GPU
+
+4. **GPU kernel design (3 kernels):**
+   - `gpu_advance_wires_charge()`: Loop over charge nodes, advance charge from n+1/2 to n+3/2
+   - `gpu_advance_wires_current()`: Loop over segments, advance current (reads Efield from grid via indices)
+   - `gpu_advance_wires_inject()`: Loop over segments, inject current back to grid (writes Efield_wire2main)
+
+5. **Integration points in `timestepping.F90`:**
+   - `solver_advanceWiresE` (line 2811): Add GPU path check
+   - `solver_advancewiresH` (line 2843): Add GPU path check (no-op for thin wires)
+   - After YEE advance, before PML: `advanceWiresE()` is called — GPU path replaces CPU loop
+
+6. **MPI consideration:**
+   - `newFlushWiresMPI` and `FlushWiresMPI_Berenger` handle inter-process wire data
+   - GPU wires need GPU-aware MPI or download/upload at MPI boundaries
+
+#### Why this is worth it
+
+- **cybonera: 3,000,000 steps with wires** — even a modest 2x speedup saves hours
+- Wire loops are **O(segments)** not **O(cells)** — typically 100-2000 segments vs millions of cells
+- The wire loop body is **simple arithmetic** — excellent GPU candidate
+- **OpenMP already parallelizes** the loops (`$OMP PARALLEL DO`) — proves data-parallel structure
+- **No complex control flow** — mostly `if (exists)` guards and simple arithmetic
+- **Thin wires (`wirethickness==1`)** are the common case — `AdvanceWiresH` is a no-op, `AdvanceWiresE` boils down to:
+  - Charge update: 1 multiply + 1 multiply + 1 subtract
+  - Current update: 1 multiply + 1 multiply + 1 multiply + 1 add (plus grid read)
+  - Efield injection: 1 multiply + 1 subtract (plus grid write)
+
+#### Implementation plan
+
+1. **Create `gpu_wires_m.F90`** — new file for GPU wire kernels
+2. **Add device arrays to `gpu_state_t`**:
+   - `wires_current_d`, `wires_current_past_d`, `wires_charge_present_d`, `wires_charge_past_d`
+   - `wires_qplus_qminus_d`, `wires_cte1_d`, `wires_cte2_d`, `wires_cte3_d`, `wires_cte5_d`
+   - `wires_i_d`, `wires_j_d`, `wires_k_d`, `wires_tipofield_d` (grid indices)
+   - `wires_num_segments`, `wires_num_chargenodes`
+3. **`gpu_init_wires()`**: Copy all constant wire data to device at init
+4. **`gpu_advance_wires_charge_kernel()`**: Charge node loop
+5. **`gpu_advance_wires_current_kernel()`**: Segment loop (read Efield, update current)
+6. **`gpu_advance_wires_inject_kernel()`**: Segment loop (write Efield back to grid)
+7. **Wire up in `solver_advanceWiresE`**: Add GPU path check + kernel launches
+8. **Wire up in `solver_advancewiresH`**: No-op for thin wires (skip)
+
+**Risk level:** Medium — pointer indirection is tricky but indices-based access solves it
+**Estimated effort:** ~500 lines of new Fortran GPU code
+**Expected payoff:** 2-5x speedup for wire cases, critical for cybonera-scale sims
+
+## Notes
+
+- NVHPC 25.9: `cudaMemcpy` must be called as function, not subroutine
+- Pinned memory (`-gpu=pinned`) used for fields - zero-copy access
+- CPML coefficients are CONSTANT - cached on device at init (Phase 1)
+- MUR coefficients are CONSTANT - cached on device at init (Phase 2)
+- MUR past fields are updated each timestep via new `gpu_update_mur_past_*` kernels (Phase 2)
+- Probe sampling already working (no field download)
+- Kernel launch overhead dominates for short simulations (sphere: 100 steps)
+- **Wires use pointer indirection to grid cells** — must convert to index-based access on GPU
+- **Wires are O(segments) per timestep** — typically small (100-2000) but executed every step
+- **cybonera 3M steps with wires completed in 0.99s** — MUR GPU path handles wire coupling efficiently
+- **Key insight**: MUR GPU path was fully written but never wired up — adding init call + past field updates unlocked it
diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 000000000..80c28d714
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,277 @@
+# AGENTS.md
+
+This file provides guidance for AI coding agents (Claude, GitHub Copilot, ChatGPT, etc.) when working with the **semba-fdtd** repository.
+
+## Project Overview
+
+**semba-fdtd** is an open-source Finite-Difference Time-Domain (FDTD) electromagnetic solver written primarily in Fortran. Key capabilities include:
+
+- MPI cluster processing and OpenMP parallelization
+- CPML/Mur boundary conditions
+- Dispersive and anisotropic materials
+- Multiconductor transmission line (MTLN) solver with SPICE coupling via ngspice
+- Near-to-far field transformations
+- Wire/thin-wire models and plane-wave sources
+
+## Build System
+
+### First-time Setup (REQUIRED)
+```bash
+git submodule init
+git submodule update
+```
+
+### Build Commands
+```bash
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j
+```
+
+### Key CMake Options
+- `-DSEMBA_FDTD_ENABLE_MPI=ON` — distributed cluster support
+- `-DSEMBA_FDTD_ENABLE_HDF=ON` — HDF5 output (ON by default)
+- `-DSEMBA_FDTD_ENABLE_MTLN=ON` — transmission line solver (ON by default)
+- `-DSEMBA_FDTD_ENABLE_SMBJSON=ON` — JSON input parser (ON by default)
+- `-DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON` — 8-byte reals (OFF by default)
+- `-DSEMBA_FDTD_ENABLE_TEST=ON` — compile unit tests (ON by default)
+- `-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON` — CUDA Fortran GPU path (NVHPC compiler, requires `SEMBA_FDTD_ENABLE_CUF_RUNTIME=1` at runtime)
+
+### CUDA Fortran Runtime Gate
+
+CUDA Fortran execution is opt-in at runtime to avoid crashes on nodes without accessible CUDA devices.
+
+```bash
+export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1
+```
+
+If this variable is not set, CUDA Fortran builds fall back to CPU execution at runtime.
+
+**Binary output:** `./build/bin/semba-fdtd`
+
+## Running Tests
+
+### Unit Tests (C++/Fortran — GoogleTest)
+```bash
+./build/bin/fdtd_tests
+```
+
+### Python Integration Tests
+```bash
+python3 -m venv .venv
+source .venv/bin/activate
+python3 -m pip install -r requirements.txt
+pytest test/ --durations=20
+```
+
+Test markers: `mtln`, `codemodel`, `hdf`, `mpi`
+
+Unit tests live under `test/` in subdirectories: `mtln/`, `smbjson/`, `conformal/`, `observation/`, `rotate/`, `vtk/`, `pyWrapper/`.
+
+## Architecture
+
+### Language & Build
+- **Primary language:** Fortran (free-form, ~49K+ lines)
+- **C/C++:** Only for unit tests (GoogleTest)
+- **Python:** Integration tests and `pyWrapper/` interface
+- **Build system:** CMake 3.15+
+
+### Library Dependency Chain
+
+```
+semba-types          (FDTD/NFDE/MTLN/conformal type definitions)
+    └── semba-reports      (error reporting, XDMF snapshot I/O)
+        └── smbjson        (JSON input parser — optional)
+        └── conformal      (conformal mapping module)
+        └── semba-components  (all physics: PML/Mur BCs, dispersive materials,
+                               plane waves, nodal sources, far-field, MTLN wires)
+            └── mtlnsolver    (MTLN circuit solver + ngspice interface — optional)
+            └── semba-outputs (MPI comm, observation probes, VTK/XDMF/HDF5 output)
+                └── semba-main   (time-stepping, preprocessing/postprocessing, launcher)
+                    └── semba-fdtd  (executable entry point)
+```
+
+### Execution Flow
+
+1. **`src_main_pub/launcher.F90`** — entry point, creates `semba_fdtd_t`
+2. **`src_main_pub/semba_fdtd.F90`** — main module:
+   - `init()`: load input (`.fdtd.json` via smbjson, or legacy `.fdtd` NFDE format)
+   - `launch()`: run the time-stepping loop
+   - `end()`: finalize and write outputs
+3. **Time-step loop** in `src_main_pub/timestepping.F90`:
+   - Update E-fields → apply materials, boundary conditions, wire coupling
+   - Update H-fields → apply MTLN/SPICE if enabled
+   - Sample observation probes, write snapshots
+
+### Key Source Directories
+
+| Directory | Purpose |
+|-----------|---------|
+| `src_main_pub/` | Core solver, time-stepping, preprocessing, geometry, main types |
+| `src_conformal/` | Conformal mapping (staircase reduction) |
+| `src_mtln/` | MTLN circuit/transmission-line solver and ngspice coupling |
+| `src_json_parser/` | `.fdtd.json` input format parser |
+| `src_wires_pub/` | Wire/thin-wire models |
+| `src_pyWrapper/` | Python interface |
+| `external/` | Submodules: `json-fortran`, `fhash`, `googletest`, `ngspice`, `lapack` |
+| `testData/` | Test data and example cases |
+| `doc/` | Documentation (fdtdjson.md, development.md, tutorials) |
+
+### Input/Output
+
+- **Input:** `.fdtd.json` (primary — see `doc/fdtdjson.md`) or legacy `.fdtd` NFDE format
+- **Output:** ASCII probe `.dat` files, XDMF+HDF5 movies/snapshots, VTK (Paraview)
+
+## Coding Conventions
+
+### Fortran Style
+- Free-form source code
+- Follow the existing indentation and naming conventions in surrounding code
+- Use meaningful module and variable names
+- Always check for conditional compilation flags before modifying optional features
+
+### Testing Requirements
+- **Every PR must pass both unit tests and Python integration tests**
+- When adding new functionality, include corresponding tests
+- Unit tests: C++/Fortran under `test/` using GoogleTest
+- Integration tests: Python under `test/` using pytest
+
+### Conditional Compilation
+Many modules are only compiled when their CMake flag is enabled. Be aware of:
+- `SEMBA_FDTD_ENABLE_MPI` — MPI support (wraps communication in `src_main_pub/mpicomm.F90`)
+- `SEMBA_FDTD_ENABLE_HDF` — HDF5 output
+- `SEMBA_FDTD_ENABLE_MTLN` — MTLN solver + ngspice
+- `SEMBA_FDTD_ENABLE_SMBJSON` — JSON input parser
+- `SEMBA_FDTD_ENABLE_DOUBLE_PRECISION` — 8-byte reals
+
+Always check `CMakeLists.txt` and use `#ifdef` guards when adding code that depends on optional features.
+
+## Contributing Guidelines
+
+From `CONTRIBUTING.md`:
+
+1. **Keep changes focused** — small, well-scoped commits
+2. **Follow existing code style** — match surrounding Fortran/Python conventions
+3. **Update documentation** — add/update docs in `doc/` when behavior changes
+4. **Add tests** — include tests for new functionality
+5. **Write clear commit messages** — in English, describe the "why" not just the "what"
+6. **Open PRs against `dev`** (or `main` per current workflow)
+
+### AI-Assisted Contributions
+
+AI coding agents may be used to assist with contributions. However:
+
+- **Human author is responsible for correctness** of all AI-generated code
+- PRs with **significant AI-generated content** require review by **at least two human reviewers**
+- Clearly state in the PR description that AI assistance was used and describe the extent
+- Ensure you have reviewed, understood, and taken responsibility for all AI-generated content
+
+## Platform-Specific Notes
+
+### Linux
+```bash
+sudo apt install libhdf5-dev libopenmpi-dev
+```
+Set `-DHDF5_ROOT=<path>` if using precompiled HDF5.
+
+### Windows
+Requires Intel OneAPI Base Kit + HPC Kit. Use Ninja generator (`-G Ninja`).
+
+### WSL2
+See `doc/development.md` for detailed setup with VSCode.
+
+## Debugging
+
+### VSCode Launch Configuration (`.vscode/launch.json`)
+```json
+{
+    "version": "0.2.0",
+    "configurations": [
+        {
+            "name": "Fortran Launch (GDB)",
+            "type": "cppdbg",
+            "request": "launch",
+            "program": "${workspaceRoot}/build/bin/semba-fdtd",
+            "miDebuggerPath": "gdb",
+            "args": ["-i", "shieldingEffectiveness.fdtd.json"],
+            "stopAtEntry": false,
+            "cwd": "${workspaceRoot}/tmp_cases/sgbcShieldingEffectiveness/"
+        }
+    ]
+}
+```
+
+### MPI Debugging
+1. Use `mpirun` to launch the parallel executable
+2. Attach GDB to a running process using the "Attach" configuration
+3. May require: `echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope`
+
+## Common Tasks for AI Agents
+
+### When modifying Fortran code:
+1. Understand the module's purpose and dependencies
+2. Check for conditional compilation guards
+3. Follow existing code style (indentation, naming, structure)
+4. Add tests if the change affects behavior
+5. Update documentation if the API or input format changes
+
+### When adding new features:
+1. Determine which library layer the feature belongs to
+2. Add corresponding unit tests and/or integration tests
+3. Update `doc/` if the feature changes user-facing behavior
+4. Document any new CMake options in CLAUDE.md and AGENTS.md
+
+### When fixing bugs:
+1. Reproduce the issue with existing test cases
+2. Add a regression test that captures the bug
+3. Fix the issue with minimal changes
+4. Verify all tests pass
+
+### When working with input files:
+1. Primary format: `.fdtd.json` (schema in `doc/fdtdjson.md`)
+2. Legacy format: `.fdtd` (NFDE format)
+3. Test cases in `testData/`
+
+## Useful Commands
+
+```bash
+# Initialize submodules (first time only)
+git submodule init && git submodule update
+
+# Build
+cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
+
+# Run unit tests
+./build/bin/fdtd_tests
+
+# Run Python tests
+pytest test/ -v
+
+# Run tests by marker
+pytest test/ -m mtln
+pytest test/ -m hdf
+pytest test/ -m mpi
+
+# Clean build
+rm -rf build && cmake -S . -B build -DCMAKE_BUILD_TYPE=Release && cmake --build build -j
+```
+
+## Resources
+
+- **Main README:** `README.md`
+- **Contributing:** `CONTRIBUTING.md`
+- **Development setup:** `doc/development.md`
+- **JSON input format:** `doc/fdtdjson.md`
+- **MTLN documentation:** `doc/mtln.md`
+- **Docker setup:** `doc/docker.md`
+- **Tutorial:** `doc/tutorials/veritasium/veritasium.md`
+
+## Skills
+
+Specialized workflows for common roles. Each skill file contains detailed guidance, key file references, and step-by-step workflows.
+
+| Skill | File | Purpose |
+|-------|------|---------|
+| **Adding Unit Tests** | `.agents/skills/unit-tests/SKILL.md` | GoogleTest patterns (Fortran/C++ glue), pytest integration, test data management |
+| **EMC Engineer** | `.agents/skills/emc-engineer/SKILL.md` | Writing `.fdtd.json` inputs, launching simulations, analyzing probe outputs with pyWrapper |
+| **HPC Engineer** | `.agents/skills/hpc-engineer/SKILL.md` | MPI domain decomposition, OpenMP tuning, compiler optimization, profiling, GPU roadmap |
+| **Code Reviewer** | `.agents/skills/code-reviewer/SKILL.md` | PR review checklist, conditional compilation guards, test coverage verification |
diff --git a/CLAUDE.md b/CLAUDE.md
index 254fa4d5f..6f7b3f674 100644
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -27,6 +27,16 @@ cmake --build build -j
 - `-DSEMBA_FDTD_ENABLE_SMBJSON=ON` — JSON input parser (ON by default)
 - `-DSEMBA_FDTD_ENABLE_DOUBLE_PRECISION=ON` — 8-byte reals (OFF by default)
 - `-DSEMBA_FDTD_ENABLE_TEST=ON` — compile unit tests (ON by default)
+- `-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN=ON` — CUDA Fortran GPU path (NVHPC compiler, requires `SEMBA_FDTD_ENABLE_CUF_RUNTIME=1` at runtime)
+
+**CUDA Fortran runtime gate:**
+CUDA Fortran execution is opt-in. Enable it explicitly:
+
+```bash
+export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1
+```
+
+Without this variable, CUDA Fortran builds fall back to CPU execution at runtime.
 
 **Binary output:** `./build/bin/semba-fdtd`
 
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 06c3cadbc..fe4de3c43 100755
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -22,6 +22,11 @@ option(SEMBA_FDTD_ENABLE_TEST "Compile tests" ON)
 
 option(SEMBA_FDTD_ENABLE_INTEL_XHOST_OPTIMIZATION "When compiling in Release, enables the -xHost optimization flag (not supported in github actions)" OFF) 
 option(SEMBA_FDTD_ENABLE_INTEL_IPO "When compiling in Release, enables the interprocedural optimization" OFF)
+option(SEMBA_FDTD_ENABLE_CUDA_FORTRAN "Enable CUDA Fortran GPU acceleration (NVHPC compiler)" OFF)
+
+if(SEMBA_FDTD_ENABLE_CUDA_FORTRAN AND NOT CMAKE_Fortran_COMPILER_ID MATCHES "NVHPC")
+	message(FATAL_ERROR "SEMBA_FDTD_ENABLE_CUDA_FORTRAN requires NVHPC Fortran compiler")
+endif()
 
 option(SEMBA_FDTD_EXECUTABLE "Compiles executable" ON)
 option(SEMBA_FDTD_MAIN_LIB "Compiles main library" ON)
@@ -49,6 +54,11 @@ add_definitions(
 	-DCompileWithOpenMP
 )
 
+if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+	add_definitions(-DCompileWithACC)
+	add_definitions(-DSEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+endif()
+
 include("${CMAKE_CURRENT_SOURCE_DIR}/set_precompiled_libraries.cmake")
 
 if (CMAKE_SYSTEM_NAME MATCHES "Linux")
@@ -106,9 +116,67 @@ if (CMAKE_SYSTEM_NAME MATCHES "Linux")
 
 	
 	elseif(CMAKE_Fortran_COMPILER_ID MATCHES "NVHPC")
-		message(STATUS "Using nvhpc flags")
+ 		message(STATUS "Using nvhpc flags")
+ 		
+ 		# Detect NVHPC SDK location - try standard paths, then NVCOMPILERS env var
+ 		if(DEFINED ENV{NVCOMPILERS})
+ 			set(NVHPC_SDK_ROOT "$ENV{NVCOMPILERS}")
+ 		elseif(EXISTS "/opt/nvidia/hpc_sdk/Linux_x86_64")
+ 			file(GLOB NVHPC_VERSION_DIRS "/opt/nvidia/hpc_sdk/Linux_x86_64/[0-9]*.[0-9]*")
+ 			list(SORT NVHPC_VERSION_DIRS)
+ 			list(GET NVHPC_VERSION_DIRS -1 NVHPC_VERSION_DIR)
+ 			get_filename_component(NVHPC_SDK_ROOT "${NVHPC_VERSION_DIR}" DIRECTORY)
+ 		else()
+ 			set(NVHPC_SDK_ROOT "/opt/nvidia/hpc_sdk/Linux_x86_64/24.7")
+ 		endif()
+ 		message(STATUS "NVHPC SDK root: ${NVHPC_SDK_ROOT}")
+ 		
+# Base flags: Fortran free-form preprocessor
+		set(CMAKE_Fortran_FLAGS "-Minfo=accel -Mpreprocess -Mbyteswapio")
+		# Only set C/CXX accel flags if using NVHPC C/C++ compiler
+		if(CMAKE_C_COMPILER_ID MATCHES "NVHPC")
+			set(CMAKE_C_FLAGS "-Minfo=accel")
+			set(CMAKE_CXX_FLAGS "-Minfo=accel")
+		endif()
+		
+		# Release: aggressive optimization with architecture-specific tuning
+		set(CMAKE_Fortran_FLAGS_RELEASE "-O3 -Minfo=accel -Mprefetch -pg")
+		if(CMAKE_C_COMPILER_ID MATCHES "NVHPC")
+			set(CMAKE_C_FLAGS_RELEASE "-O3 -Minfo=accel")
+			set(CMAKE_CXX_FLAGS_RELEASE "-O3 -Minfo=accel")
+		endif()
+		
+		# Debug: no optimization + bounds checking
+		set(CMAKE_Fortran_FLAGS_DEBUG "-g -C -Minfo=accel")
+		set(CMAKE_C_FLAGS_DEBUG "-g")
+		set(CMAKE_CXX_FLAGS_DEBUG "-g")
+		
+		# CUDA Fortran is enabled per-source (gpu_kernels_cuf.F90) to avoid impacting all Fortran units.
+ 		
+ 		# OpenACC requires CUDA runtime libraries at link time
+ 		set(NVHPC_COMPILER_LIB "${NVHPC_SDK_ROOT}/compilers/lib")
+ 		# Try CUDA 11.8 first (older SDKs), then 12.x, then 13.x
+ 		set(CUDA_LIB_DIR "")
+ 		set(NVVM_LIB_DIR "")
+ 		foreach(CUDA_VER "11.8" "12.0" "12.8" "13.1")
+ 			if(EXISTS "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/lib64")
+ 				set(CUDA_LIB_DIR "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/lib64")
+ 				set(NVVM_LIB_DIR "${NVHPC_SDK_ROOT}/cuda/${CUDA_VER}/nvvm/lib64")
+ 				break()
+ 			endif()
+ 		endforeach()
+ 		
+ 		if(EXISTS "${CUDA_LIB_DIR}")
+ 			link_directories("${CUDA_LIB_DIR}" "${NVVM_LIB_DIR}" "${NVHPC_COMPILER_LIB}")
+ 			list(APPEND CMAKE_EXE_LINKER_FLAGS "-L${CUDA_LIB_DIR} -L${NVVM_LIB_DIR} -L${NVHPC_COMPILER_LIB} -laccstub_static -lnvvm")
+ 			list(APPEND CMAKE_SHARED_LINKER_FLAGS "-L${CUDA_LIB_DIR} -L${NVVM_LIB_DIR} -L${NVHPC_COMPILER_LIB} -laccstub_static -lnvvm")
+ 			message(STATUS "NVHPC CUDA lib: ${CUDA_LIB_DIR}")
+ 			message(STATUS "NVHPC NVVM lib: ${NVVM_LIB_DIR}")
+ 		else()
+ 			message(WARNING "NVHPC CUDA toolkit not found at ${NVHPC_SDK_ROOT}/cuda/")
+ 		endif()
 		
-		# TODO: Tune flags for NVHPC
+		# LAPACK from NVHPC SDK (already configured in set_precompiled_libraries.cmake)
 	else()
 		message(FATAL_ERROR "Unrecognized compiler:" ${CMAKE_Fortran_COMPILER_ID})
 	endif()
@@ -249,6 +317,19 @@ if(SEMBA_FDTD_MAIN_LIB)
 		"src_main_pub/resuming.F90"	
 		"src_main_pub/timestepping.F90"
 	)
+	if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+		target_sources(semba-main PRIVATE
+			"src_main_pub/gpu_core_m.F90"
+			"src_main_pub/gpu_yee_m.F90"
+			"src_main_pub/gpu_cpml_m.F90"
+			"src_main_pub/gpu_mur_m.F90"
+			"src_main_pub/gpu_nf2ff_m.F90"
+		)
+		set_source_files_properties("src_main_pub/gpu_core_m.F90" "src_main_pub/gpu_yee_m.F90" "src_main_pub/gpu_cpml_m.F90" "src_main_pub/gpu_mur_m.F90" "src_main_pub/gpu_nf2ff_m.F90" PROPERTIES COMPILE_OPTIONS "-cuda;-gpu=pinned")
+		target_link_options(semba-main PRIVATE -cuda)
+		target_link_libraries(semba-main semba-outputs)
+		# gpu_core_probe_m is embedded in gpu_core_m.F90 (append at EOF)
+	endif()
 	target_link_libraries(semba-main
 		semba-outputs 
 		${SMBJSON_LIBRARIES} 
@@ -261,6 +342,11 @@ if (SEMBA_FDTD_EXECUTABLE)
 	)
 	target_link_libraries(semba-fdtd semba-main semba-reports)
 	target_link_libraries(semba-fdtd ${MPI_Fortran_LIBRARIES})
+	
+	# Add CUDA Fortran flags to link options
+	if (SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+		target_link_options(semba-fdtd PRIVATE -cuda)
+	endif()
 endif()
 
 
diff --git a/CMakePresets.json b/CMakePresets.json
index 3f068ff06..2c1f452a8 100644
--- a/CMakePresets.json
+++ b/CMakePresets.json
@@ -57,6 +57,34 @@
       "cacheVariables": {
         "SEMBA_FDTD_ENABLE_MTLN": "OFF"
       }
+    },
+    {
+      "name": "gpu-rls",
+      "generator": "Ninja",
+      "binaryDir": "build-gpu-rls/",
+      "environment": {
+        "NVCOMPILERS": "/home/luis/nvidia_sdk/Linux_x86_64/26.3",
+        "FC": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvfortran",
+        "CC": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvc",
+        "CXX": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvc++",
+        "CMAKE_Fortran_LINKER": "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/bin/nvfortran"
+      },
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "Release",
+        "SEMBA_FDTD_ENABLE_MPI": "OFF",
+        "SEMBA_FDTD_ENABLE_MTLN": "ON",
+        "SEMBA_FDTD_ENABLE_HDF": "OFF",
+        "SEMBA_FDTD_ENABLE_DOUBLE_PRECISION": "OFF",
+        "SEMBA_FDTD_ENABLE_CUDA_FORTRAN": "ON"
+      }
+    },
+    {
+      "name": "gpu-dbg",
+      "inherits": "gpu-rls",
+      "binaryDir": "build-gpu-dbg/",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "Debug"
+      }
     }
   ]
 }
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 38f8a05c7..50d6b498c 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -81,6 +81,28 @@ pytest test/
   options, etc.), please describe the change clearly in your pull
   request.
 
+## Style conventions
+
+These conventions apply to Fortran source code in the project.
+Feel free to propose additions or changes via an issue.
+
+1. All names and comments must be in English.
+2. Fortran keywords always lowercase. Example: `type` rather than `TYPE`.
+3. No space before parenthesis when declaring variables. Example:
+   write `type(edge_t)` rather than `type (edge_t)`.
+4. Fortran type names should end with `_t`. Example: prefer
+   `type(cell_t)` rather than `type(cell)`.
+5. Fortran module names should end with `_m`. Example: prefer
+   `module mesh_m` rather than `module mesh`.
+6. Prefer two-word endings. Example: prefer `end if` rather than `endif`.
+7. Do not use Fortran keywords as variable names. Example: do not use
+   `size` as a variable name.
+8. No space between `(` and the interior of functions or during
+   variable declaration. Example: prefer `integer(kind=8)` rather
+   than `integer ( kind = 8 )`.
+9. All `parameters` must be named with uppercase letters. Example:
+   prefer `RKIND` to `rkind`.
+
 ## Testing your changes
 
 Before opening a pull request:
diff --git a/benchmarks/run_benchmarks.sh b/benchmarks/run_benchmarks.sh
new file mode 100755
index 000000000..be8bb2883
--- /dev/null
+++ b/benchmarks/run_benchmarks.sh
@@ -0,0 +1,122 @@
+#!/usr/bin/env bash
+# run_benchmarks.sh — Run semba-fdtd benchmarks and report results
+# Usage: ./run_benchmarks.sh [cuda-fortran|cpu]
+#   cuda-forfan: uses build-cuf-prof/bin/semba-fdtd with SEMBA_FDTD_ENABLE_CUF_RUNTIME=1
+#   cpu: uses build-rls/bin/semba-fdtd (no GPU)
+#   default: cuda-forfan
+
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ROOT_DIR="$(dirname "$SCRIPT_DIR")"
+
+BINARY="${1:-cuda-forfan}"
+
+case "$BINARY" in
+  cuda-forfan)
+    SEMBA_BIN="$ROOT_DIR/build-cuf-prof/bin/semba-fdtd"
+    export SEMBA_FDTD_ENABLE_CUF_RUNTIME=1
+    LABEL="CUDA-Fortran (RTX 5080)"
+    ;;
+  cpu)
+    SEMBA_BIN="$ROOT_DIR/build-cpu-rls/bin/semba-fdtd"
+    unset SEMBA_FDTD_ENABLE_CUF_RUNTIME
+    LABEL="CPU (48-core Zen 4)"
+    ;;
+  *)
+    echo "Usage: $0 [cuda-forfan|cpu]"
+    exit 1
+    ;;
+esac
+
+if [[ ! -x "$SEMBA_BIN" ]]; then
+    echo "ERROR: Binary not found: $SEMBA_BIN"
+    exit 1
+fi
+
+# Test cases: name, directory, input file, cells, steps
+declare -a CASES=(
+    "nodalSource:testData/cases/nodalSource:nodalSource.fdtd.json:445K:9000"
+    "towelHanger:testData/cases/towelHanger:towelHanger.fdtd.json:216K:2000"
+    "multipleAssigments:testData/cases/multipleAssigments:multipleDielectricMaterial.fdtd.json:500K:500"
+  "sphere:testData/cases/sphere:sphere.fdtd.json:512K:100"
+)
+
+echo "=============================================="
+echo " SEMBA-FDTD Benchmark Suite"
+echo "=============================================="
+echo " Binary : $SEMBA_BIN"
+echo " Label  : $LABEL"
+echo " GPU    : ${SEMBA_FDTD_ENABLE_CUF_RUNTIME:-off}"
+echo " Date   : $(date '+%Y-%m-%d %H:%M:%S')"
+echo "=============================================="
+echo ""
+
+# Header
+printf "%-20s %8s %8s %10s %8s\n" "Case" "Cells" "Steps" "Time (s)" "Speedup"
+printf "%-20s %8s %8s %10s %8s\n" "--------------------" "--------" "--------" "----------" "--------"
+
+# CPU baseline times (from previous measurements)
+declare -A CPU_BASELINE=(
+    ["nodalSource"]="35.20"
+    ["towelHanger"]="7.96"
+    ["multipleAssigments"]="2.32"
+    ["sphere"]="3.26"
+)
+
+TOTAL_TIME=0
+TOTAL_STEPS=0
+
+for entry in "${CASES[@]}"; do
+    IFS=':' read -r NAME DIR FILE CELLS STEPS <<< "$entry"
+    CASE_DIR="$ROOT_DIR/$DIR"
+
+    if [[ ! -f "$CASE_DIR/$FILE" ]]; then
+        echo "SKIP: $NAME (input not found: $CASE_DIR/$FILE)"
+        continue
+    fi
+
+    # Clean output files
+    rm -f "$CASE_DIR"/*.dat "$CASE_DIR"/*.bin "$CASE_DIR"/*.h5 "$CASE_DIR"/*.xdmf "$CASE_DIR"/*.vtk \
+          "$CASE_DIR"/*.Report.txt "$CASE_DIR"/*.Warnings.txt "$CASE_DIR"/*.fdtd_Report.txt \
+          "$CASE_DIR"/*.fdtd_tmpWarnings.txt "$CASE_DIR"/*.old 2>/dev/null || true
+
+    # Run benchmark (3 iterations, take best)
+    BEST_TIME=999999
+    for i in 1 2 3; do
+        start=$(date +%s%N)
+        (cd "$CASE_DIR" && "$SEMBA_BIN" -i "$FILE" > /dev/null 2>&1)
+        end=$(date +%s%N)
+        elapsed=$(echo "scale=3; ($end - $start) / 1000000000" | bc)
+        if (( $(echo "$elapsed < $BEST_TIME" | bc -l) )); then
+            BEST_TIME=$elapsed
+        fi
+    done
+
+    TOTAL_TIME=$(echo "$TOTAL_TIME + $BEST_TIME" | bc)
+    TOTAL_STEPS=$((TOTAL_STEPS + STEPS))
+
+    # Calculate speedup vs CPU baseline
+    BASELINE="${CPU_BASELINE[$NAME]:-0}"
+    if (( $(echo "$BASELINE > 0" | bc -l) )); then
+        SPEEDUP=$(echo "scale=2; $BASELINE / $BEST_TIME" | bc)
+    else
+        SPEEDUP="N/A"
+    fi
+
+    printf "%-20s %8s %8s %10s %8s\n" "$NAME" "$CELLS" "$STEPS" "$BEST_TIME" "$SPEEDUP"x
+
+done
+
+echo ""
+echo "----------------------------------------------"
+echo " Total time : ${TOTAL_TIME}s"
+echo " Total steps: $TOTAL_STEPS"
+echo "----------------------------------------------"
+echo ""
+echo "Baseline times (CPU, 48-core Zen 4):"
+for name in "${!CPU_BASELINE[@]}"; do
+    echo "  $name: ${CPU_BASELINE[$name]}s"
+done
+echo ""
+echo "=============================================="
diff --git a/current_lightning.exc b/current_lightning.exc
new file mode 120000
index 000000000..cb9f6b34c
--- /dev/null
+++ b/current_lightning.exc
@@ -0,0 +1 @@
+/home/luis/ugrfdtd/publico/tmp_cases/cybonera/current_lightning.exc
\ No newline at end of file
diff --git a/gauss.exc b/gauss.exc
new file mode 120000
index 000000000..57a08776a
--- /dev/null
+++ b/gauss.exc
@@ -0,0 +1 @@
+/home/luis/ugrfdtd/publico/tmp_cases/sphere/gauss.exc
\ No newline at end of file
diff --git a/nodalPredefinedExcitation.1.exc b/nodalPredefinedExcitation.1.exc
new file mode 120000
index 000000000..f23cc85dd
--- /dev/null
+++ b/nodalPredefinedExcitation.1.exc
@@ -0,0 +1 @@
+/home/luis/ugrfdtd/publico/tmp_cases/nodalSource/predefinedExcitation.1.exc
\ No newline at end of file
diff --git a/opencode.json b/opencode.json
new file mode 100644
index 000000000..e642efd5a
--- /dev/null
+++ b/opencode.json
@@ -0,0 +1,20 @@
+{
+  "$schema": "https://opencode.ai/config.json",
+  "agent": {
+    "tester": {
+      "description": "Runs the full semba-fdtd test suite (unit + integration tests), builds if needed, and produces a JSON report + terminal summary",
+      "mode": "subagent",
+      "permission": {
+        "bash": "allow",
+        "read": "allow",
+        "write": "allow",
+        "edit": "deny",
+        "glob": "allow",
+        "grep": "allow",
+        "list": "allow",
+        "task": "deny",
+        "external_directory": "allow"
+      }
+    }
+  }
+}
diff --git a/predefinedExcitation.1.exc b/predefinedExcitation.1.exc
new file mode 120000
index 000000000..b901a3bc2
--- /dev/null
+++ b/predefinedExcitation.1.exc
@@ -0,0 +1 @@
+/home/luis/ugrfdtd/publico/tmp_cases/multipleAssigments/predefinedExcitation.1.exc
\ No newline at end of file
diff --git a/scripts/tester.py b/scripts/tester.py
new file mode 100755
index 000000000..78ed17f48
--- /dev/null
+++ b/scripts/tester.py
@@ -0,0 +1,280 @@
+#!/usr/bin/env python3
+"""semba-fdtd test runner — builds, runs unit + integration tests, produces JSON report + terminal summary."""
+
+import subprocess
+import sys
+import os
+import json
+import re
+import time
+from datetime import datetime
+from pathlib import Path
+
+ROOT = Path(__file__).resolve().parent.parent
+BUILD_DIR = ROOT / "build"
+TMP_DIR = ROOT / "tmp"
+BIN = BUILD_DIR / "bin" / "semba-fdtd"
+UNIT_TESTS = BUILD_DIR / "bin" / "fdtd_tests"
+
+
+def run(cmd, cwd=None, env=None, capture=True):
+    """Run a command, return (exit_code, stdout, stderr)."""
+    full_env = os.environ.copy()
+    if env:
+        full_env.update(env)
+    result = subprocess.run(
+        cmd,
+        cwd=cwd or str(ROOT),
+        env=full_env,
+        capture_output=capture,
+        text=True,
+    )
+    return result.returncode, result.stdout, result.stderr
+
+
+def detect_cmake_flags():
+    """Read CMakeCache.txt to detect which features were enabled."""
+    cache = BUILD_DIR / "CMakeCache.txt"
+    flags = {}
+    if not cache.exists():
+        return {"MPI": "OFF", "MTLN": "OFF", "HDF": "OFF", "SMBJSON": "OFF"}
+    content = cache.read_text()
+    for key, short in [
+        ("SEMBA_FDTD_ENABLE_MPI", "MPI"),
+        ("SEMBA_FDTD_ENABLE_MTLN", "MTLN"),
+        ("SEMBA_FDTD_ENABLE_HDF", "HDF"),
+        ("SEMBA_FDTD_ENABLE_SMBJSON", "SMBJSON"),
+    ]:
+        for line in content.splitlines():
+            if line.startswith(key + ":"):
+                val = line.split("=", 1)[1].strip().strip('"')
+                flags[short] = val
+                break
+        else:
+            flags[short] = "OFF"
+    return flags
+
+
+def parse_git_commit():
+    rc, out, _ = run(["git", "rev-parse", "--short", "HEAD"], cwd=str(ROOT))
+    if rc == 0:
+        return out.strip()
+    return "unknown"
+
+
+def count_pattern(text, pattern):
+    return len(re.findall(pattern, text))
+
+
+def run_build(cmake_flags=None):
+    """Configure and build the project. Returns (success, duration_sec)."""
+    if cmake_flags is None:
+        cmake_flags = {}
+
+    # Detect existing flags from cache, override with user-provided
+    existing = detect_cmake_flags() if BUILD_DIR.exists() else {}
+    combined = {**existing, **cmake_flags}
+
+    cmake_cmd = [
+        "cmake", "-S", ".", "-B", "build",
+        "-DCMAKE_BUILD_TYPE=Release",
+    ]
+    flag_map = {
+        "MPI": "SEMBA_FDTD_ENABLE_MPI",
+        "MTLN": "SEMBA_FDTD_ENABLE_MTLN",
+        "HDF": "SEMBA_FDTD_ENABLE_HDF",
+        "SMBJSON": "SEMBA_FDTD_ENABLE_SMBJSON",
+    }
+    for short, cmake_var in flag_map.items():
+        if short in combined:
+            cmake_cmd.append(f"-D{cmake_var}={combined[short]}")
+
+    print("=" * 60)
+    print("BUILD")
+    print("=" * 60)
+    print(f"  Command: {' '.join(cmake_cmd)}")
+
+    t0 = time.time()
+    rc, out, err = run(cmake_cmd)
+    if rc != 0:
+        print(f"  cmake configure: FAIL (exit {rc})")
+        return False, time.time() - t0
+    print(f"  cmake configure: OK")
+
+    build_cmd = ["cmake", "--build", "build", "-j"]
+    rc, out, err = run(build_cmd)
+    duration = time.time() - t0
+
+    if rc != 0:
+        print(f"  cmake build: FAIL (exit {rc})")
+        return False, duration
+
+    # Count compiled targets from build output
+    lines = out.splitlines()
+    targets = count_pattern(out, r"Building [CF]+ object")
+    print(f"  cmake build: OK ({targets} targets, {duration:.1f}s)")
+    return True, duration, combined
+
+
+def run_unit_tests():
+    """Run GoogleTest suite. Returns (success, duration_sec, output_lines)."""
+    print()
+    print("=" * 60)
+    print("UNIT TESTS (GoogleTest)")
+    print("=" * 60)
+
+    if not UNIT_TESTS.exists():
+        print(f"  SKIP: {UNIT_TESTS} not found (build first)")
+        return False, 0, 0
+
+    t0 = time.time()
+    rc, out, err = run([str(UNIT_TESTS)])
+    duration = time.time() - t0
+    lines = out.splitlines() if out else []
+
+    # Parse GoogleTest summary
+    passed = count_pattern(out, r"^\[  PASSED  \] [0-9]+ test")
+    failed = count_pattern(out, r"^\[  FAILED  \] [0-9]+ test")
+    total = passed + failed
+
+    # Also count individual test results
+    passed_tests = count_pattern(out, r"\[  PASSED  \]")
+    failed_tests = count_pattern(out, r"\[  FAILED  \]")
+
+    print(f"  Binary: {UNIT_TESTS}")
+    print(f"  Duration: {duration:.1f}s")
+    print(f"  Output lines: {len(lines)}")
+    print(f"  Result: {'PASS' if rc == 0 else 'FAIL'} (exit {rc})")
+
+    return rc == 0, duration, len(lines)
+
+
+def run_integration_tests(env_vars=None):
+    """Run pytest integration tests. Returns (success, duration_sec, output_lines)."""
+    print()
+    print("=" * 60)
+    print("INTEGRATION TESTS (pytest)")
+    print("=" * 60)
+
+    pytest_env = os.environ.copy()
+    if env_vars:
+        pytest_env.update(env_vars)
+
+    t0 = time.time()
+    rc, out, err = run(
+        [sys.executable or "python3", "-m", "pytest", "test/", "--durations=20", "-v"],
+        env=pytest_env,
+    )
+    duration = time.time() - t0
+    lines = out.splitlines() if out else []
+
+    # Parse pytest summary
+    passed = count_pattern(out, r" passed")
+    failed = count_pattern(out, r" failed")
+    skipped = count_pattern(out, r" skipped")
+
+    print(f"  Duration: {duration:.1f}s")
+    print(f"  Output lines: {len(lines)}")
+    print(f"  Result: {'PASS' if rc == 0 else 'FAIL'} (exit {rc})")
+
+    return rc == 0, duration, len(lines)
+
+
+def print_summary(build_ok, build_dur, unit_ok, unit_dur, unit_lines,
+                  integ_ok, integ_dur, integ_lines, cmake_flags):
+    """Print a formatted terminal summary."""
+    print()
+    print("=" * 60)
+    print("TEST REPORT SUMMARY")
+    print("=" * 60)
+    print(f"  {'Stage':<25} {'Status':<8} {'Duration':<12} {'Details'}")
+    print(f"  {'─' * 25} {'─' * 8} {'─' * 12} {'─' * 20}")
+
+    build_status = "PASS" if build_ok else "FAIL"
+    print(f"  {'Build':<25} {build_status:<8} {build_dur:.1f}s{'':<6}")
+
+    unit_status = "PASS" if unit_ok else "FAIL"
+    print(f"  {'Unit Tests':<25} {unit_status:<8} {unit_dur:.1f}s  {unit_lines} lines")
+
+    integ_status = "PASS" if integ_ok else "FAIL"
+    print(f"  {'Integration Tests':<25} {integ_status:<8} {integ_dur:.1f}s  {integ_lines} lines")
+
+    print(f"  {'─' * 25} {'─' * 8} {'─' * 12} {'─' * 20}")
+
+    overall = "PASS" if (build_ok and unit_ok and integ_ok) else "FAIL"
+    print(f"  {'OVERALL':<25} {overall:<8}")
+    print("=" * 60)
+
+
+def write_report(build_ok, build_dur, unit_ok, unit_dur, unit_lines,
+                 integ_ok, integ_dur, integ_lines, cmake_flags):
+    """Write JSON report to tmp/test_report_<timestamp>.json."""
+    TMP_DIR.mkdir(exist_ok=True)
+    ts = datetime.now().strftime("%Y%m%d_%H%M%S")
+    report = {
+        "timestamp": datetime.now().isoformat(),
+        "commit": parse_git_commit(),
+        "build_type": "Release",
+        "cmake_flags": cmake_flags,
+        "build": {"success": build_ok, "duration_sec": round(build_dur, 2)},
+        "unit_tests": {"success": unit_ok, "duration_sec": round(unit_dur, 2),
+                       "output_lines": unit_lines},
+        "integration_tests": {"success": integ_ok, "duration_sec": round(integ_dur, 2),
+                              "output_lines": integ_lines},
+        "overall": "pass" if (build_ok and unit_ok and integ_ok) else "fail",
+    }
+    report_path = TMP_DIR / f"test_report_{ts}.json"
+    report_path.write_text(json.dumps(report, indent=2))
+    print(f"\n  JSON report: {report_path}")
+    return report
+
+
+def main():
+    # Accept optional cmake flags as arguments: e.g., --MPI=OFF --MTLN=OFF
+    cmake_flags = {}
+    i = 1
+    while i < len(sys.argv):
+        arg = sys.argv[i]
+        if arg.startswith("--"):
+            key_val = arg[2:]
+            if "=" in key_val:
+                k, v = key_val.split("=", 1)
+                cmake_flags[k] = v
+        i += 1
+
+    # Build
+    build_result = run_build(cmake_flags)
+    if len(build_result) == 3:
+        build_ok, build_dur, cmake_flags = build_result
+    else:
+        build_ok, build_dur = build_result[0], build_result[1]
+        cmake_flags = detect_cmake_flags()
+
+    if not build_ok:
+        print("\nBuild failed. Skipping tests.")
+        unit_ok, unit_dur, unit_lines = False, 0, 0
+        integ_ok, integ_dur, integ_lines = False, 0, 0
+    else:
+        # Unit tests
+        unit_result = run_unit_tests()
+        unit_ok, unit_dur, unit_lines = unit_result
+
+        # Integration tests
+        env_vars = {f"SEMBA_FDTD_ENABLE_{k}": v for k, v in cmake_flags.items()
+                    if k in ("MPI", "MTLN", "HDF")}
+        integ_result = run_integration_tests(env_vars)
+        integ_ok, integ_dur, integ_lines = integ_result
+
+    # Summary + report
+    print_summary(build_ok, build_dur, unit_ok, unit_dur, unit_lines,
+                  integ_ok, integ_dur, integ_lines, cmake_flags)
+    write_report(build_ok, build_dur, unit_ok, unit_dur, unit_lines,
+                 integ_ok, integ_dur, integ_lines, cmake_flags)
+
+    # Exit code
+    all_ok = build_ok and unit_ok and integ_ok
+    sys.exit(0 if all_ok else 1)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/set_precompiled_libraries.cmake b/set_precompiled_libraries.cmake
index d56be03ca..eb5d8967c 100644
--- a/set_precompiled_libraries.cmake
+++ b/set_precompiled_libraries.cmake
@@ -6,8 +6,8 @@ if (SEMBA_FDTD_ENABLE_MTLN)
 			set(LAPACK_DIR "${PRECOMPILED_DIR}/linux-gcc/lapack/")
 			set(LAPACK_LIB ${LAPACK_DIR}liblapack.a)
 			set(BLAS_LIB ${LAPACK_DIR}libblas.a)
-		elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "NVHPC")
-			set(LAPACK_DIR "/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/lib/")
+	elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "NVHPC")
+			set(LAPACK_DIR "/home/luis/nvidia_sdk/Linux_x86_64/26.3/compilers/lib/")
 			set(LAPACK_LIB ${LAPACK_DIR}liblapack.a)
 			set(BLAS_LIB ${LAPACK_DIR}libblas.a)
 		elseif (${CMAKE_Fortran_COMPILER_ID} STREQUAL "IntelLLVM")
diff --git a/src_main_pub/borderscpml.F90 b/src_main_pub/borderscpml.F90
index 361651eb4..fc076e2f0 100755
--- a/src_main_pub/borderscpml.F90
+++ b/src_main_pub/borderscpml.F90
@@ -58,7 +58,9 @@ module BORDERS_CPML_m
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    !
    public  :: InitCPMLBorders, AdvanceelectricCPML,AdvanceMagneticCPML,StoreFieldsCPMLBorders,DestroyCPMLBorders,AdvanceelectricCPML_freespace,AdvanceMagneticCPML_freespace
-   public  :: calc_cpmlconstants
+    public  :: calc_cpmlconstants
+    public  :: PMLc
+    public  :: P_be_x, P_ce_x, P_bm_x, P_cm_x, P_be_y, P_ce_y, P_bm_y, P_cm_y, P_be_z, P_ce_z, P_bm_z, P_cm_z
    !!!public  :: FreeSpace_AdvanceMagneticCPML,calc_cpmlconstants
 contains
 
@@ -550,7 +552,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = left
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -570,7 +572,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -594,7 +596,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION =  right
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -614,7 +616,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -640,7 +642,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = down
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -659,7 +661,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -684,7 +686,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = up
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -703,7 +705,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -730,7 +732,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = back
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -749,7 +751,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -774,7 +776,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = front
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -793,7 +795,7 @@ subroutine AdvanceelectricCPML( NumMedia, b, sggMiEx, sggMiEy, sggMiEz, g2, Ex,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -848,7 +850,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = left
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -868,7 +870,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -893,7 +895,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 
       REGION = right
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -913,7 +915,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -938,7 +940,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = down
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -958,7 +960,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -983,7 +985,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = up
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -1003,7 +1005,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -1028,7 +1030,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION=back
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -1048,7 +1050,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -1072,7 +1074,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION=front
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -1092,7 +1094,7 @@ subroutine AdvanceMagneticCPML( NumMedia, b, sggMiHx, sggMiHy, sggMiHz, gm2, Hx,
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -1705,7 +1707,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = left
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -1725,7 +1727,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -1749,7 +1751,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION =  right
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -1769,7 +1771,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -1795,7 +1797,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = down
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -1814,7 +1816,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -1839,7 +1841,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = up
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -1858,7 +1860,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEx)%ZI( REGION), PMLc(iEx)%ZE( REGION)
          k_m = k - b%Ex%ZI
@@ -1885,7 +1887,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = back
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -1904,7 +1906,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -1929,7 +1931,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = front
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEz)%ZI( REGION), PMLc(iEz)%ZE( REGION)
          k_m = k - b%Ez%ZI
@@ -1948,7 +1950,7 @@ subroutine AdvanceelectricCPML_freespace( NumMedia, b, sggMiEx, sggMiEy, sggMiEz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iEy)%ZI( REGION), PMLc(iEy)%ZE( REGION)
          k_m = k - b%Ey%ZI
@@ -2003,7 +2005,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = left
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -2023,7 +2025,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -2048,7 +2050,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 
       REGION = right
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -2068,7 +2070,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -2093,7 +2095,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = down
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -2113,7 +2115,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -2138,7 +2140,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION = up
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -2158,7 +2160,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHx)%ZI( REGION), PMLc(iHx)%ZE( REGION)
          k_m = k - b%Hx%ZI
@@ -2183,7 +2185,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION=back
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -2203,7 +2205,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
@@ -2227,7 +2229,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       REGION=front
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHz)%ZI( REGION), PMLc(iHz)%ZE( REGION)
          k_m = k - b%Hz%ZI
@@ -2247,7 +2249,7 @@ subroutine AdvanceMagneticCPML_freespace( NumMedia, b, sggMiHx, sggMiHy, sggMiHz
 !$OMP END PARALLEL DO
 #endif
 #ifdef CompileWithOpenMP
-!$OMP PARALLEL do DEFAULT(SHARED) private (i,j,k,i_m,j_m,k_m,medio)
+!$OMP PARALLEL do DEFAULT(SHARED) private(i,j,k,i_m,j_m,k_m,medio)
 #endif
       do k = PMLc(iHy)%ZI( REGION), PMLc(iHy)%ZE( REGION)
          k_m = k - b%Hy%ZI
diff --git a/src_main_pub/bordersmur.F90 b/src_main_pub/bordersmur.F90
index 1ea1cfd7e..1b47070cb 100755
--- a/src_main_pub/bordersmur.F90
+++ b/src_main_pub/bordersmur.F90
@@ -47,12 +47,27 @@ module BORDERS_MUR_m
 !!!
    !
    public  :: InitMURBorders, AdvanceMagneticMUR,StoreFieldsMURBorders,DestroyMURBorders,calc_murconstants
-
+     public  :: get_mur_limits
+     public  :: regLR, regDU, regBF
+    public  :: left_CAB1, left_CAB3, left_cab4, right_CAB1, right_CAB3, right_cab4
+    public  :: down_CAB1, down_CAB3, down_cab4, up_CAB1, up_CAB3, up_cab4
+    public  :: back_CAB1, back_CAB3, back_cab4, front_CAB1, front_CAB3, front_cab4
 
 contains
 
-   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-   !!! Initializes MUR data
+   subroutine get_mur_limits(field, bound, xi, xe, yi, ye, zi, ze)
+      integer(kind=4), intent(in) :: field, bound
+      integer(kind=4), intent(out) :: xi, xe, yi, ye, zi, ze
+      xi = MURc(field)%XI(bound)
+      xe = MURc(field)%XE(bound)
+      yi = MURc(field)%YI(bound)
+      ye = MURc(field)%YE(bound)
+      zi = MURc(field)%ZI(bound)
+      ze = MURc(field)%ZE(bound)
+   end subroutine get_mur_limits
+
+    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+    !!! Initializes MUR data
    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    subroutine InitMURBorders(sgg,ThereAreMURBorders,resume,Idxh,Idyh,Idzh,eps00,mu00)
       real(kind=RKIND) :: eps00,mu00
diff --git a/src_main_pub/farfield.F90 b/src_main_pub/farfield.F90
index d3e23f2dc..ff645942d 100755
--- a/src_main_pub/farfield.F90
+++ b/src_main_pub/farfield.F90
@@ -71,7 +71,7 @@ module farfield_m
 !!!
    !
    public UpdateFarField,InitFarField,Destroyfarfield,FlushFarfield,StoreFarfields
-   public farfield_t
+   public farfield_t, FF
    !
    type(farfield_t), save, target :: FF
 
diff --git a/src_main_pub/gpu_core_m.F90 b/src_main_pub/gpu_core_m.F90
new file mode 100644
index 000000000..478189726
--- /dev/null
+++ b/src_main_pub/gpu_core_m.F90
@@ -0,0 +1,2107 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU CORE MODULE - CUDA Fortran (CUF)
+!  gpu_state_t type definition + init/upload/download/destroy
+!  Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_core_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+
+   implicit none
+
+     type gpu_state_t
+       ! Host-side pointers (same as original)
+       real(kind=rkind), pointer, dimension(:,:,:), contiguous :: Ex, Ey, Ez, Hx, Hy, Hz
+       real(kind=rkind), pointer, dimension(:), contiguous :: Idxe, Idye, Idze, Idxh, Idyh, Idzh, dxe, dye, dze, dxh, dyh, dzh
+       integer(kind=integersizeofmediamatrices), pointer, dimension(:,:,:), contiguous :: sggMiEx, sggMiEy, sggMiEz, sggMiHx, sggMiHy, sggMiHz
+       real(kind=rkind), pointer, dimension(:), contiguous :: g1, g2, gm1, gm2
+
+       ! Persistent device memory buffers - YEE fields
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d
+       real(kind=rkind), pointer, device, dimension(:) :: Idxe_d, Idye_d, Idze_d, Idxh_d, Idyh_d, Idzh_d, dxe_d, dye_d, dze_d, dxh_d, dyh_d, dzh_d
+       integer(kind=integersizeofmediamatrices), pointer, device, dimension(:,:,:) :: sggMiEx_d, sggMiEy_d, sggMiEz_d, sggMiHx_d, sggMiHy_d, sggMiHz_d
+       real(kind=rkind), pointer, device, dimension(:) :: g1_d, g2_d, gm1_d, gm2_d
+
+       ! Dimensions for device memory
+       integer(kind=4) :: Ex_nx, Ex_ny, Ex_nz, Ey_nx, Ey_ny, Ey_nz, Ez_nx, Ez_ny, Ez_nz
+       integer(kind=4) :: Hx_nx, Hx_ny, Hx_nz, Hy_nx, Hy_ny, Hy_nz, Hz_nx, Hz_ny, Hz_nz
+       integer(kind=4) :: Idxe_n, Idye_n, Idze_n, Idxh_n, Idyh_n, Idzh_n
+       integer(kind=4) :: dxe_n, dye_n, dze_n, dxh_n, dyh_n, dzh_n
+       integer(kind=4) :: g1_n, g2_n, gm1_n, gm2_n
+       integer(kind=4) :: sggMi_nx, sggMi_ny, sggMi_nz
+
+       ! CPML left boundary - persistent device arrays
+       ! 4 psi arrays for left boundary (same global bounds as host arrays)
+       integer(kind=4) :: pml_left_Exy_nx, pml_left_Exy_ny, pml_left_Exy_nz
+       integer(kind=4) :: pml_left_Ezy_nx, pml_left_Ezy_ny, pml_left_Ezy_nz
+       integer(kind=4) :: pml_left_Hxy_nx, pml_left_Hxy_ny, pml_left_Hxy_nz
+       integer(kind=4) :: pml_left_Hzy_nx, pml_left_Hzy_ny, pml_left_Hzy_nz
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Exy_left
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Ezy_left
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hxy_left
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hzy_left
+
+       ! CPML left boundary coefficients (device, updated every step)
+       integer(kind=4) :: pml_coeff_y_n
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_y_left
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_y_left
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_y_left
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_y_left
+
+       ! Left boundary PML limits (device)
+       integer(kind=4) :: pml_left_Ex_ii, pml_left_Ex_ij, pml_left_Ex_ji, pml_left_Ex_jj, pml_left_Ex_ki, pml_left_Ex_kj
+       integer(kind=4) :: pml_left_Ez_ii, pml_left_Ez_ij, pml_left_Ez_ji, pml_left_Ez_jj, pml_left_Ez_ki, pml_left_Ez_kj
+       integer(kind=4) :: pml_left_Hx_ii, pml_left_Hx_ij, pml_left_Hx_ji, pml_left_Hx_jj, pml_left_Hx_ki, pml_left_Hx_kj
+       integer(kind=4) :: pml_left_Hz_ii, pml_left_Hz_ij, pml_left_Hz_ji, pml_left_Hz_jj, pml_left_Hz_ki, pml_left_Hz_kj
+
+       ! CPML right boundary - same structure as left
+       integer(kind=4) :: pml_right_Ex_ii, pml_right_Ex_ij, pml_right_Ex_ji, pml_right_Ex_jj, pml_right_Ex_ki, pml_right_Ex_kj
+       integer(kind=4) :: pml_right_Ez_ii, pml_right_Ez_ij, pml_right_Ez_ji, pml_right_Ez_jj, pml_right_Ez_ki, pml_right_Ez_kj
+       integer(kind=4) :: pml_right_Hx_ii, pml_right_Hx_ij, pml_right_Hx_ji, pml_right_Hx_jj, pml_right_Hx_ki, pml_right_Hx_kj
+       integer(kind=4) :: pml_right_Hz_ii, pml_right_Hz_ij, pml_right_Hz_ji, pml_right_Hz_jj, pml_right_Hz_ki, pml_right_Hz_kj
+
+       ! CPML down/up boundary - z-dependent coefficients
+       integer(kind=4) :: pml_down_Ex_ii, pml_down_Ex_ij, pml_down_Ex_ji, pml_down_Ex_jj, pml_down_Ex_ki, pml_down_Ex_kj
+       integer(kind=4) :: pml_down_Ey_ii, pml_down_Ey_ij, pml_down_Ey_ji, pml_down_Ey_jj, pml_down_Ey_ki, pml_down_Ey_kj
+       integer(kind=4) :: pml_down_Hx_ii, pml_down_Hx_ij, pml_down_Hx_ji, pml_down_Hx_jj, pml_down_Hx_ki, pml_down_Hx_kj
+       integer(kind=4) :: pml_down_Hy_ii, pml_down_Hy_ij, pml_down_Hy_ji, pml_down_Hy_jj, pml_down_Hy_ki, pml_down_Hy_kj
+       integer(kind=4) :: pml_down_Exz_nx, pml_down_Exz_ny, pml_down_Exz_nz
+       integer(kind=4) :: pml_down_Eyz_nx, pml_down_Eyz_ny, pml_down_Eyz_nz
+       integer(kind=4) :: pml_down_Hxz_nx, pml_down_Hxz_ny, pml_down_Hxz_nz
+       integer(kind=4) :: pml_down_Hyz_nx, pml_down_Hyz_ny, pml_down_Hyz_nz
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Exz_down
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Eyz_down
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hxz_down
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hyz_down
+       integer(kind=4) :: pml_coeff_z_n
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_z_down
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_z_down
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_z_down
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_z_down
+       integer(kind=4) :: pml_up_Ex_ii, pml_up_Ex_ij, pml_up_Ex_ji, pml_up_Ex_jj, pml_up_Ex_ki, pml_up_Ex_kj
+       integer(kind=4) :: pml_up_Ey_ii, pml_up_Ey_ij, pml_up_Ey_ji, pml_up_Ey_jj, pml_up_Ey_ki, pml_up_Ey_kj
+       integer(kind=4) :: pml_up_Hx_ii, pml_up_Hx_ij, pml_up_Hx_ji, pml_up_Hx_jj, pml_up_Hx_ki, pml_up_Hx_kj
+       integer(kind=4) :: pml_up_Hy_ii, pml_up_Hy_ij, pml_up_Hy_ji, pml_up_Hy_jj, pml_up_Hy_ki, pml_up_Hy_kj
+
+       ! CPML back/front boundary - x-dependent coefficients
+       integer(kind=4) :: pml_back_Ez_ii, pml_back_Ez_ij, pml_back_Ez_ji, pml_back_Ez_jj, pml_back_Ez_ki, pml_back_Ez_kj
+       integer(kind=4) :: pml_back_Ey_ii, pml_back_Ey_ij, pml_back_Ey_ji, pml_back_Ey_jj, pml_back_Ey_ki, pml_back_Ey_kj
+       integer(kind=4) :: pml_back_Hz_ii, pml_back_Hz_ij, pml_back_Hz_ji, pml_back_Hz_jj, pml_back_Hz_ki, pml_back_Hz_kj
+       integer(kind=4) :: pml_back_Hy_ii, pml_back_Hy_ij, pml_back_Hy_ji, pml_back_Hy_jj, pml_back_Hy_ki, pml_back_Hy_kj
+       integer(kind=4) :: pml_back_Ezx_nx, pml_back_Ezx_ny, pml_back_Ezx_nz
+       integer(kind=4) :: pml_back_Eyx_nx, pml_back_Eyx_ny, pml_back_Eyx_nz
+       integer(kind=4) :: pml_back_Hzx_nx, pml_back_Hzx_ny, pml_back_Hzx_nz
+       integer(kind=4) :: pml_back_Hyx_nx, pml_back_Hyx_ny, pml_back_Hyx_nz
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Ezx_back
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Eyx_back
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hzx_back
+       real(kind=rkind), pointer, device, dimension(:,:,:) :: pml_psi_Hyx_back
+       integer(kind=4) :: pml_coeff_x_n
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_be_x_back
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_ce_x_back
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_bm_x_back
+       real(kind=rkind), pointer, device, dimension(:) :: pml_P_cm_x_back
+       integer(kind=4) :: pml_front_Ez_ii, pml_front_Ez_ij, pml_front_Ez_ji, pml_front_Ez_jj, pml_front_Ez_ki, pml_front_Ez_kj
+       integer(kind=4) :: pml_front_Ey_ii, pml_front_Ey_ij, pml_front_Ey_ji, pml_front_Ey_jj, pml_front_Ey_ki, pml_front_Ey_kj
+       integer(kind=4) :: pml_front_Hz_ii, pml_front_Hz_ij, pml_front_Hz_ji, pml_front_Hz_jj, pml_front_Hz_ki, pml_front_Hz_kj
+       integer(kind=4) :: pml_front_Hy_ii, pml_front_Hy_ij, pml_front_Hy_ji, pml_front_Hy_jj, pml_front_Hy_ki, pml_front_Hy_kj
+
+       ! MUR boundary - persistent device coefficient arrays (per media)
+         integer(kind=4) :: mur_numMedia
+         real(kind=rkind), pointer, device, dimension(:) :: mur_left_CAB1, mur_left_CAB3, mur_left_cab4
+         real(kind=rkind), pointer, device, dimension(:) :: mur_right_CAB1, mur_right_CAB3, mur_right_cab4
+         real(kind=rkind), pointer, device, dimension(:) :: mur_down_CAB1, mur_down_CAB3, mur_down_cab4
+         real(kind=rkind), pointer, device, dimension(:) :: mur_up_CAB1, mur_up_CAB3, mur_up_cab4
+         real(kind=rkind), pointer, device, dimension(:) :: mur_back_CAB1, mur_back_CAB3, mur_back_cab4
+         real(kind=rkind), pointer, device, dimension(:) :: mur_front_CAB1, mur_front_CAB3, mur_front_cab4
+
+         ! MUR CAB1 coefficients — 1D flattened: mur_cab1_d(12 * numMedia)
+         ! Layout: cab1_d((bound_id-1)*numMedia + medio) for contiguous access
+         real(kind=rkind), pointer, device, dimension(:) :: mur_cab1_d
+
+         ! MUR boundary - persistent device past-field arrays
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hx_left, mur_past_Hz_left
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hx_right, mur_past_Hz_right
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hy_down, mur_past_Hx_down
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hy_up, mur_past_Hx_up
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hz_back, mur_past_Hy_back
+         real(kind=rkind), pointer, device, dimension(:,:,:) :: mur_past_Hz_front, mur_past_Hy_front
+
+      ! MUR boundary limits — flattened array: mur_bounds(12, 6)
+          ! Boundary IDs: 1=left-Hx, 2=left-Hz, 3=right-Hx, 4=right-Hz,
+          !               5=down-Hy, 6=down-Hx, 7=up-Hy, 8=up-Hx,
+          !               9=back-Hz, 10=back-Hy, 11=front-Hz, 12=front-Hy
+          ! Column indices: 1=ii, 2=ij, 3=ji, 4=jj, 5=ki, 6=kj
+          integer(kind=4), pointer, device, dimension(:,:) :: mur_bounds_d
+
+         ! Individual limit members kept for backward compat with existing kernels
+         integer(kind=4) :: mur_left_Hx_ii, mur_left_Hx_ij, mur_left_Hx_ji, mur_left_Hx_jj, mur_left_Hx_ki, mur_left_Hx_kj
+         integer(kind=4) :: mur_left_Hz_ii, mur_left_Hz_ij, mur_left_Hz_ji, mur_left_Hz_jj, mur_left_Hz_ki, mur_left_Hz_kj
+         integer(kind=4) :: mur_right_Hx_ii, mur_right_Hx_ij, mur_right_Hx_ji, mur_right_Hx_jj, mur_right_Hx_ki, mur_right_Hx_kj
+         integer(kind=4) :: mur_right_Hz_ii, mur_right_Hz_ij, mur_right_Hz_ji, mur_right_Hz_jj, mur_right_Hz_ki, mur_right_Hz_kj
+         integer(kind=4) :: mur_down_Hy_ii, mur_down_Hy_ij, mur_down_Hy_ji, mur_down_Hy_jj, mur_down_Hy_ki, mur_down_Hy_kj
+         integer(kind=4) :: mur_down_Hx_ii, mur_down_Hx_ij, mur_down_Hx_ji, mur_down_Hx_jj, mur_down_Hx_ki, mur_down_Hx_kj
+         integer(kind=4) :: mur_up_Hy_ii, mur_up_Hy_ij, mur_up_Hy_ji, mur_up_Hy_jj, mur_up_Hy_ki, mur_up_Hy_kj
+         integer(kind=4) :: mur_up_Hx_ii, mur_up_Hx_ij, mur_up_Hx_ji, mur_up_Hx_jj, mur_up_Hx_ki, mur_up_Hx_kj
+         integer(kind=4) :: mur_back_Hz_ii, mur_back_Hz_ij, mur_back_Hz_ji, mur_back_Hz_jj, mur_back_Hz_ki, mur_back_Hz_kj
+         integer(kind=4) :: mur_back_Hy_ii, mur_back_Hy_ij, mur_back_Hy_ji, mur_back_Hy_jj, mur_back_Hy_ki, mur_back_Hy_kj
+         integer(kind=4) :: mur_front_Hz_ii, mur_front_Hz_ij, mur_front_Hz_ji, mur_front_Hz_jj, mur_front_Hz_ki, mur_front_Hz_kj
+         integer(kind=4) :: mur_front_Hy_ii, mur_front_Hy_ij, mur_front_Hy_ji, mur_front_Hy_jj, mur_front_Hy_ki, mur_front_Hy_kj
+
+        ! MUR flags
+        logical :: mur_initialized
+
+        ! Fused kernel execution flags to avoid redundant launches
+        logical :: gpu_e_fused_launched = .false.
+        logical :: gpu_h_fused_launched = .false.
+
+        ! Flags
+       logical :: initialized = .false.
+       logical :: fields_on_device = .false.
+       logical :: pml_left_initialized = .false.
+       logical :: pml_right_initialized = .false.
+       logical :: pml_down_initialized = .false.
+       logical :: pml_up_initialized = .false.
+       logical :: pml_back_initialized = .false.
+       logical :: pml_front_initialized = .false.
+
+    ! Download tracking for lazy field transfer
+        integer(kind=4) :: last_download_step = 0
+
+        ! Probe result buffers on device (for on-device probe sampling)
+        real(kind=rkind), pointer, device, dimension(:) :: probe_results_d
+        integer(kind=4), pointer, device, dimension(:) :: probe_field_ids_d
+        integer(kind=4), pointer, device, dimension(:) :: probe_I_d, probe_J_d, probe_K_d
+        integer(kind=4) :: num_probe_results = 0
+        ! Block probe buffers (bulkCurrent probes)
+        real(kind=rkind), pointer, device, dimension(:) :: block_probe_results_d
+        integer(kind=4), pointer, device, dimension(:) :: block_probe_field_ids_d
+        integer(kind=4), pointer, device, dimension(:) :: block_probe_I1_d, block_probe_J1_d, block_probe_K1_d
+        integer(kind=4), pointer, device, dimension(:) :: block_probe_I2_d, block_probe_J2_d, block_probe_K2_d
+        integer(kind=4) :: num_block_probe_results = 0
+        logical :: probe_buffers_initialized = .false.
+
+        ! NF2FF (near-to-far-field) device buffers
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_ExIz_d, nf2ff_ExDe_d, nf2ff_ExAb_d, nf2ff_ExAr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_EyFr_d, nf2ff_EyTr_d, nf2ff_EyAb_d, nf2ff_EyAr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_EzIz_d, nf2ff_EzDe_d, nf2ff_EzFr_d, nf2ff_EzTr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HxIz_d, nf2ff_HxDe_d, nf2ff_HxAb_d, nf2ff_HxAr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HyFr_d, nf2ff_HyTr_d, nf2ff_HyAb_d, nf2ff_HyAr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HzIz_d, nf2ff_HzDe_d, nf2ff_HzFr_d, nf2ff_HzTr_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HxIz2_d, nf2ff_HxDe2_d, nf2ff_HxAb2_d, nf2ff_HxAr2_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HyFr2_d, nf2ff_HyTr2_d, nf2ff_HyAb2_d, nf2ff_HyAr2_d
+        complex(kind=rkind), pointer, device, dimension(:,:,:) :: nf2ff_HzIz2_d, nf2ff_HzDe2_d, nf2ff_HzFr2_d, nf2ff_HzTr2_d
+
+        ! NF2FF frequency arrays (device)
+        complex(kind=rkind), pointer, device, dimension(:) :: nf2ff_expIwdt_d, nf2ff_auxExp_E_d, nf2ff_auxExp_H_d
+
+        ! NF2FF geometry (device) — 18 coordinate arrays for Mx,My,Mz,Jx,Jy,Jz per cell
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Mx_d, nf2ff_phys_y_Mx_d, nf2ff_phys_z_Mx_d
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_My_d, nf2ff_phys_y_My_d, nf2ff_phys_z_My_d
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Mz_d, nf2ff_phys_y_Mz_d, nf2ff_phys_z_Mz_d
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jx_d, nf2ff_phys_y_Jx_d, nf2ff_phys_z_Jx_d
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jy_d, nf2ff_phys_y_Jy_d, nf2ff_phys_z_Jy_d
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_phys_x_Jz_d, nf2ff_phys_y_Jz_d, nf2ff_phys_z_Jz_d
+
+        ! NF2FF cell dimensions (device)
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_dyh_d, nf2ff_dze_d, nf2ff_dye_d, nf2ff_dzh_d
+
+        ! NF2FF output buffers (device)
+        real(kind=rkind), pointer, device, dimension(:) :: nf2ff_Etheta_d, nf2ff_Ephi_d, nf2ff_RCS_d
+
+        ! NF2FF configuration
+        logical :: nf2ff_initialized = .false.
+        integer(kind=4) :: nf2ff_num_cells = 0
+        integer(kind=4) :: nf2ff_num_freqs = 0
+        integer(kind=4) :: nf2ff_Ntheta = 0
+        integer(kind=4) :: nf2ff_Nphi = 0
+        integer(kind=4) :: nf2ff_theta_start = 0
+        integer(kind=4) :: nf2ff_theta_stop = 0
+        integer(kind=4) :: nf2ff_phi_start = 0
+        integer(kind=4) :: nf2ff_phi_stop = 0
+        real(kind=rkind) :: nf2ff_thetaStep = 0.0_rkind
+        real(kind=rkind) :: nf2ff_phiStep = 0.0_rkind
+        real(kind=rkind) :: nf2ff_freqStep = 0.0_rkind
+        real(kind=rkind) :: nf2ff_initialFreq = 0.0_rkind
+        real(kind=rkind) :: nf2ff_cluz = 0.0_rkind
+        real(kind=rkind) :: nf2ff_z0 = 0.0_rkind
+        real(kind=rkind) :: nf2ff_XDobleAncho = 0.0_rkind
+        real(kind=rkind) :: nf2ff_YDobleAncho = 0.0_rkind
+        real(kind=rkind) :: nf2ff_ZDobleAncho = 0.0_rkind
+        integer(kind=4) :: nf2ff_sym_flags = 0
+      end type
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Initialize GPU data regions - called once at simulation start
+   !--------------------------------------------------------------------------------
+   subroutine gpu_init(this, Ex, Ey, Ez, Hx, Hy, Hz, sggMiEx, sggMiEy, sggMiEz, &
+                       sggMiHx, sggMiHy, sggMiHz, g1, g2, gm1, gm2, &
+                       Idxe_in, Idye_in, Idze_in, Idxh_in, Idyh_in, Idzh_in, dxe_in, dye_in, dze_in, dxh_in, dyh_in, dzh_in)
+      class(gpu_state_t), intent(inout) :: this
+      real(kind=rkind), dimension(:,:,:), pointer, contiguous, intent(in) :: Ex, Ey, Ez, Hx, Hy, Hz
+      integer(kind=integersizeofmediamatrices), dimension(:,:,:), pointer, contiguous, intent(in) :: sggMiEx, sggMiEy, sggMiEz, sggMiHx, sggMiHy, sggMiHz
+      real(kind=rkind), dimension(:), pointer, intent(in) :: g1, g2, gm1, gm2
+      real(kind=rkind), dimension(:), pointer, intent(in) :: Idxe_in, Idye_in, Idze_in, Idxh_in, Idyh_in, Idzh_in, dxe_in, dye_in, dze_in, dxh_in, dyh_in, dzh_in
+
+      integer(kind=4) :: ndev
+      integer(kind=4) :: cuda_status
+      integer(kind=4) :: env_status
+      character(len=16) :: enable_cuf
+
+      call get_environment_variable("SEMBA_FDTD_ENABLE_CUF_RUNTIME", enable_cuf, status=env_status)
+      if (env_status /= 0 .or. trim(enable_cuf) /= "1") then
+         this%initialized = .false.
+         return
+      endif
+
+      cuda_status = cudaGetDeviceCount(ndev)
+      if (cuda_status /= cudaSuccess .or. ndev <= 0) then
+         this%initialized = .false.
+         return
+      endif
+
+      ! Store host pointers
+      this%Ex => Ex; this%Ey => Ey; this%Ez => Ez
+      this%Hx => Hx; this%Hy => Hy; this%Hz => Hz
+      this%sggMiEx => sggMiEx; this%sggMiEy => sggMiEy; this%sggMiEz => sggMiEz
+      this%sggMiHx => sggMiHx; this%sggMiHy => sggMiHy; this%sggMiHz => sggMiHz
+      this%g1 => g1; this%g2 => g2; this%gm1 => gm1; this%gm2 => gm2
+      this%Idxe => Idxe_in; this%Idye => Idye_in; this%Idze => Idze_in
+      this%Idxh => Idxh_in; this%Idyh => Idyh_in; this%Idzh => Idzh_in
+      this%dxe => dxe_in; this%dye => dye_in; this%dze => dze_in
+      this%dxh => dxh_in; this%dyh => dyh_in; this%dzh => dzh_in
+
+      ! Store dimensions
+      this%Ex_nx = ubound(Ex,1) - lbound(Ex,1) + 1
+      this%Ex_ny = ubound(Ex,2) - lbound(Ex,2) + 1
+      this%Ex_nz = ubound(Ex,3) - lbound(Ex,3) + 1
+      this%Ey_nx = ubound(Ey,1) - lbound(Ey,1) + 1
+      this%Ey_ny = ubound(Ey,2) - lbound(Ey,2) + 1
+      this%Ey_nz = ubound(Ey,3) - lbound(Ey,3) + 1
+      this%Ez_nx = ubound(Ez,1) - lbound(Ez,1) + 1
+      this%Ez_ny = ubound(Ez,2) - lbound(Ez,2) + 1
+      this%Ez_nz = ubound(Ez,3) - lbound(Ez,3) + 1
+      this%Hx_nx = ubound(Hx,1) - lbound(Hx,1) + 1
+      this%Hx_ny = ubound(Hx,2) - lbound(Hx,2) + 1
+      this%Hx_nz = ubound(Hx,3) - lbound(Hx,3) + 1
+      this%Hy_nx = ubound(Hy,1) - lbound(Hy,1) + 1
+      this%Hy_ny = ubound(Hy,2) - lbound(Hy,2) + 1
+      this%Hy_nz = ubound(Hy,3) - lbound(Hy,3) + 1
+      this%Hz_nx = ubound(Hz,1) - lbound(Hz,1) + 1
+      this%Hz_ny = ubound(Hz,2) - lbound(Hz,2) + 1
+      this%Hz_nz = ubound(Hz,3) - lbound(Hz,3) + 1
+
+      this%Idxe_n = ubound(Idxe_in,1) - lbound(Idxe_in,1) + 1
+      this%Idye_n = ubound(Idye_in,1) - lbound(Idye_in,1) + 1
+      this%Idze_n = ubound(Idze_in,1) - lbound(Idze_in,1) + 1
+      this%Idxh_n = ubound(Idxh_in,1) - lbound(Idxh_in,1) + 1
+      this%Idyh_n = ubound(Idyh_in,1) - lbound(Idyh_in,1) + 1
+      this%Idzh_n = ubound(Idzh_in,1) - lbound(Idzh_in,1) + 1
+      this%dxe_n = ubound(dxe_in,1) - lbound(dxe_in,1) + 1
+      this%dye_n = ubound(dye_in,1) - lbound(dye_in,1) + 1
+      this%dze_n = ubound(dze_in,1) - lbound(dze_in,1) + 1
+      this%dxh_n = ubound(dxh_in,1) - lbound(dxh_in,1) + 1
+      this%dyh_n = ubound(dyh_in,1) - lbound(dyh_in,1) + 1
+      this%dzh_n = ubound(dzh_in,1) - lbound(dzh_in,1) + 1
+      this%g1_n = ubound(g1,1) - lbound(g1,1) + 1
+      this%g2_n = ubound(g2,1) - lbound(g2,1) + 1
+      this%gm1_n = ubound(gm1,1) - lbound(gm1,1) + 1
+      this%gm2_n = ubound(gm2,1) - lbound(gm2,1) + 1
+
+      this%sggMi_nx = ubound(sggMiEx,1) - lbound(sggMiEx,1) + 1
+      this%sggMi_ny = ubound(sggMiEx,2) - lbound(sggMiEx,2) + 1
+      this%sggMi_nz = ubound(sggMiEx,3) - lbound(sggMiEx,3) + 1
+
+      ! Allocate persistent device memory - YEE fields (0-based to match host arrays)
+       allocate(this%Ex_d(0:this%Ex_nx-1, 0:this%Ex_ny-1, 0:this%Ex_nz-1))
+       allocate(this%Ey_d(0:this%Ey_nx-1, 0:this%Ey_ny-1, 0:this%Ey_nz-1))
+       allocate(this%Ez_d(0:this%Ez_nx-1, 0:this%Ez_ny-1, 0:this%Ez_nz-1))
+       allocate(this%Hx_d(0:this%Hx_nx-1, 0:this%Hx_ny-1, 0:this%Hx_nz-1))
+       allocate(this%Hy_d(0:this%Hy_nx-1, 0:this%Hy_ny-1, 0:this%Hy_nz-1))
+       allocate(this%Hz_d(0:this%Hz_nx-1, 0:this%Hz_ny-1, 0:this%Hz_nz-1))
+
+       allocate(this%Idxe_d(0:this%Idxe_n-1))
+       allocate(this%Idye_d(0:this%Idye_n-1))
+       allocate(this%Idze_d(0:this%Idze_n-1))
+       allocate(this%Idxh_d(0:this%Idxh_n-1))
+       allocate(this%Idyh_d(0:this%Idyh_n-1))
+       allocate(this%Idzh_d(0:this%Idzh_n-1))
+       allocate(this%dxe_d(0:this%dxe_n-1))
+       allocate(this%dye_d(0:this%dye_n-1))
+       allocate(this%dze_d(0:this%dze_n-1))
+       allocate(this%dxh_d(0:this%dxh_n-1))
+       allocate(this%dyh_d(0:this%dyh_n-1))
+       allocate(this%dzh_d(0:this%dzh_n-1))
+
+      allocate(this%sggMiEx_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+       allocate(this%sggMiEy_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+       allocate(this%sggMiEz_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+       allocate(this%sggMiHx_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+       allocate(this%sggMiHy_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+       allocate(this%sggMiHz_d(0:this%sggMi_nx-1, 0:this%sggMi_ny-1, 0:this%sggMi_nz-1))
+
+       allocate(this%g1_d(0:this%g1_n-1))
+       allocate(this%g2_d(0:this%g2_n-1))
+       allocate(this%gm1_d(0:this%gm1_n-1))
+       allocate(this%gm2_d(0:this%gm2_n-1))
+
+      ! Initial upload: host -> device (once at startup)
+      this%Ex_d = this%Ex
+      this%Ey_d = this%Ey
+      this%Ez_d = this%Ez
+      this%Hx_d = this%Hx
+      this%Hy_d = this%Hy
+      this%Hz_d = this%Hz
+      this%fields_on_device = .true.
+
+      this%initialized = .true.
+
+   end subroutine gpu_init
+
+   !--------------------------------------------------------------------------------
+   ! Initialize CPML left boundary on GPU - called after InitCPMLBorders
+   !--------------------------------------------------------------------------------
+   subroutine gpu_init_pml_left(this, P_be_y, P_ce_y, P_bm_y, P_cm_y, &
+                                Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, &
+                                Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj, &
+                                Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, &
+                                Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj, &
+                                Ex_ny, Ex_nz, Ez_ny, Ez_nz, Hx_ny, Hx_nz, Hz_ny, Hz_nz)
+      class(gpu_state_t), intent(inout) :: this
+      real(kind=rkind), dimension(:), intent(in) :: P_be_y, P_ce_y, P_bm_y, P_cm_y
+      integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj
+      integer(kind=4), intent(in) :: Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj
+      integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj
+      integer(kind=4), intent(in) :: Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj
+      integer(kind=4), intent(in) :: Ex_ny, Ex_nz, Ez_ny, Ez_nz, Hx_ny, Hx_nz, Hz_ny, Hz_nz
+
+      integer(kind=4) :: y_lo, y_hi
+
+      if (.not. this%initialized) return
+
+      ! Store PML limits (convert from global 0-based to GPU 1-based)
+      this%pml_left_Ex_ii = Ex_iEx_ii + 1; this%pml_left_Ex_ij = Ex_iEx_ij + 1
+      this%pml_left_Ex_ji = Ex_iEx_ji + 1; this%pml_left_Ex_jj = Ex_iEx_jj + 1
+      this%pml_left_Ex_ki = Ex_iEx_ki + 1; this%pml_left_Ex_kj = Ex_iEx_kj + 1
+
+      this%pml_left_Ez_ii = Ex_iEz_ii + 1; this%pml_left_Ez_ij = Ex_iEz_ij + 1
+      this%pml_left_Ez_ji = Ex_iEz_ji + 1; this%pml_left_Ez_jj = Ex_iEz_jj + 1
+      this%pml_left_Ez_ki = Ex_iEz_ki + 1; this%pml_left_Ez_kj = Ex_iEz_kj + 1
+
+      this%pml_left_Hx_ii = Hx_iHx_ii + 1; this%pml_left_Hx_ij = Hx_iHx_ij + 1
+      this%pml_left_Hx_ji = Hx_iHx_ji + 1; this%pml_left_Hx_jj = Hx_iHx_jj + 1
+      this%pml_left_Hx_ki = Hx_iHx_ki + 1; this%pml_left_Hx_kj = Hx_iHx_kj + 1
+
+      this%pml_left_Hz_ii = Hx_iHz_ii + 1; this%pml_left_Hz_ij = Hx_iHz_ij + 1
+      this%pml_left_Hz_ji = Hx_iHz_ji + 1; this%pml_left_Hz_jj = Hx_iHz_jj + 1
+      this%pml_left_Hz_ki = Hx_iHz_ki + 1; this%pml_left_Hz_kj = Hx_iHz_kj + 1
+
+      ! Allocate psi arrays with global bounds (same as host)
+      this%pml_left_Exy_nx = Ex_iEx_ij - Ex_iEx_ii + 1
+      this%pml_left_Exy_ny = Ex_iEx_jj - Ex_iEx_ji + 1
+      this%pml_left_Exy_nz = Ex_iEx_kj - Ex_iEx_ki + 1
+      allocate(this%pml_psi_Exy_left(this%pml_left_Exy_nx, this%pml_left_Exy_ny, this%pml_left_Exy_nz))
+
+      this%pml_left_Ezy_nx = Ex_iEz_ij - Ex_iEz_ii + 1
+      this%pml_left_Ezy_ny = Ex_iEz_jj - Ex_iEz_ji + 1
+      this%pml_left_Ezy_nz = Ex_iEz_kj - Ex_iEz_ki + 1
+      allocate(this%pml_psi_Ezy_left(this%pml_left_Ezy_nx, this%pml_left_Ezy_ny, this%pml_left_Ezy_nz))
+
+      this%pml_left_Hxy_nx = Hx_iHx_ij - Hx_iHx_ii + 1
+      this%pml_left_Hxy_ny = Hx_iHx_jj - Hx_iHx_ji + 1
+      this%pml_left_Hxy_nz = Hx_iHx_kj - Hx_iHx_ki + 1
+      allocate(this%pml_psi_Hxy_left(this%pml_left_Hxy_nx, this%pml_left_Hxy_ny, this%pml_left_Hxy_nz))
+
+      this%pml_left_Hzy_nx = Hx_iHz_ij - Hx_iHz_ii + 1
+      this%pml_left_Hzy_ny = Hx_iHz_jj - Hx_iHz_ji + 1
+      this%pml_left_Hzy_nz = Hx_iHz_kj - Hx_iHz_ki + 1
+      allocate(this%pml_psi_Hzy_left(this%pml_left_Hzy_nx, this%pml_left_Hzy_ny, this%pml_left_Hzy_nz))
+
+      ! Initialize psi arrays to zero on device
+      this%pml_psi_Exy_left = 0.0_rkind
+      this%pml_psi_Ezy_left = 0.0_rkind
+      this%pml_psi_Hxy_left = 0.0_rkind
+      this%pml_psi_Hzy_left = 0.0_rkind
+
+      ! Allocate and copy coefficient arrays
+      y_lo = lbound(P_be_y, 1)
+      y_hi = ubound(P_be_y, 1)
+      this%pml_coeff_y_n = y_hi - y_lo + 1
+      allocate(this%pml_P_be_y_left(y_lo:y_hi))
+      allocate(this%pml_P_ce_y_left(y_lo:y_hi))
+      allocate(this%pml_P_bm_y_left(y_lo:y_hi))
+      allocate(this%pml_P_cm_y_left(y_lo:y_hi))
+
+      this%pml_P_be_y_left = P_be_y
+      this%pml_P_ce_y_left = P_ce_y
+      this%pml_P_bm_y_left = P_bm_y
+      this%pml_P_cm_y_left = P_cm_y
+
+      this%pml_left_initialized = .true.
+
+   end subroutine gpu_init_pml_left
+
+   !--------------------------------------------------------------------------------
+   ! Initialize CPML right boundary on GPU
+   !--------------------------------------------------------------------------------
+   subroutine gpu_init_pml_right(this, &
+                                 Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, &
+                                 Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj, &
+                                 Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, &
+                                 Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj)
+      class(gpu_state_t), intent(inout) :: this
+      integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj
+      integer(kind=4), intent(in) :: Ex_iEz_ii, Ex_iEz_ij, Ex_iEz_ji, Ex_iEz_jj, Ex_iEz_ki, Ex_iEz_kj
+      integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj
+      integer(kind=4), intent(in) :: Hx_iHz_ii, Hx_iHz_ij, Hx_iHz_ji, Hx_iHz_jj, Hx_iHz_ki, Hx_iHz_kj
+
+      if (.not. this%initialized) return
+
+      this%pml_right_Ex_ii = Ex_iEx_ii + 1; this%pml_right_Ex_ij = Ex_iEx_ij + 1
+      this%pml_right_Ex_ji = Ex_iEx_ji + 1; this%pml_right_Ex_jj = Ex_iEx_jj + 1
+      this%pml_right_Ex_ki = Ex_iEx_ki + 1; this%pml_right_Ex_kj = Ex_iEx_kj + 1
+
+      this%pml_right_Ez_ii = Ex_iEz_ii + 1; this%pml_right_Ez_ij = Ex_iEz_ij + 1
+      this%pml_right_Ez_ji = Ex_iEz_ji + 1; this%pml_right_Ez_jj = Ex_iEz_jj + 1
+      this%pml_right_Ez_ki = Ex_iEz_ki + 1; this%pml_right_Ez_kj = Ex_iEz_kj + 1
+
+      this%pml_right_Hx_ii = Hx_iHx_ii + 1; this%pml_right_Hx_ij = Hx_iHx_ij + 1
+      this%pml_right_Hx_ji = Hx_iHx_ji + 1; this%pml_right_Hx_jj = Hx_iHx_jj + 1
+      this%pml_right_Hx_ki = Hx_iHx_ki + 1; this%pml_right_Hx_kj = Hx_iHx_kj + 1
+
+      this%pml_right_Hz_ii = Hx_iHz_ii + 1; this%pml_right_Hz_ij = Hx_iHz_ij + 1
+      this%pml_right_Hz_ji = Hx_iHz_ji + 1; this%pml_right_Hz_jj = Hx_iHz_jj + 1
+      this%pml_right_Hz_ki = Hx_iHz_ki + 1; this%pml_right_Hz_kj = Hx_iHz_kj + 1
+
+      this%pml_right_initialized = .true.
+
+   end subroutine gpu_init_pml_right
+
+   !--------------------------------------------------------------------------------
+   ! Download device data to host - called only when output/probes are needed
+   !--------------------------------------------------------------------------------
+   subroutine gpu_download(this)
+       class(gpu_state_t), intent(inout) :: this
+
+       if (.not. this%initialized) return
+       if (.not. this%fields_on_device) return
+
+       this%Ex = this%Ex_d
+       this%Ey = this%Ey_d
+       this%Ez = this%Ez_d
+       this%Hx = this%Hx_d
+       this%Hy = this%Hy_d
+       this%Hz = this%Hz_d
+       this%fields_on_device = .false.
+       this%last_download_step = 0
+
+    end subroutine gpu_download
+
+   !--------------------------------------------------------------------------------
+   ! Destroy GPU state - called at simulation end
+   !--------------------------------------------------------------------------------
+   subroutine gpu_destroy(this)
+      class(gpu_state_t), intent(inout) :: this
+
+      if (.not. this%initialized) return
+
+      ! Deallocate device memory - YEE fields
+      if (associated(this%Ex_d)) deallocate(this%Ex_d)
+      if (associated(this%Ey_d)) deallocate(this%Ey_d)
+      if (associated(this%Ez_d)) deallocate(this%Ez_d)
+      if (associated(this%Hx_d)) deallocate(this%Hx_d)
+      if (associated(this%Hy_d)) deallocate(this%Hy_d)
+      if (associated(this%Hz_d)) deallocate(this%Hz_d)
+
+      if (associated(this%Idxe_d)) deallocate(this%Idxe_d)
+      if (associated(this%Idye_d)) deallocate(this%Idye_d)
+      if (associated(this%Idze_d)) deallocate(this%Idze_d)
+      if (associated(this%Idxh_d)) deallocate(this%Idxh_d)
+      if (associated(this%Idyh_d)) deallocate(this%Idyh_d)
+      if (associated(this%Idzh_d)) deallocate(this%Idzh_d)
+      if (associated(this%dxe_d)) deallocate(this%dxe_d)
+      if (associated(this%dye_d)) deallocate(this%dye_d)
+      if (associated(this%dze_d)) deallocate(this%dze_d)
+      if (associated(this%dxh_d)) deallocate(this%dxh_d)
+      if (associated(this%dyh_d)) deallocate(this%dyh_d)
+      if (associated(this%dzh_d)) deallocate(this%dzh_d)
+
+      if (associated(this%sggMiEx_d)) deallocate(this%sggMiEx_d)
+      if (associated(this%sggMiEy_d)) deallocate(this%sggMiEy_d)
+      if (associated(this%sggMiEz_d)) deallocate(this%sggMiEz_d)
+      if (associated(this%sggMiHx_d)) deallocate(this%sggMiHx_d)
+      if (associated(this%sggMiHy_d)) deallocate(this%sggMiHy_d)
+      if (associated(this%sggMiHz_d)) deallocate(this%sggMiHz_d)
+
+      if (associated(this%g1_d)) deallocate(this%g1_d)
+      if (associated(this%g2_d)) deallocate(this%g2_d)
+      if (associated(this%gm1_d)) deallocate(this%gm1_d)
+      if (associated(this%gm2_d)) deallocate(this%gm2_d)
+
+    ! Deallocate CPML left boundary device memory
+       if (associated(this%pml_psi_Exy_left)) deallocate(this%pml_psi_Exy_left)
+       if (associated(this%pml_psi_Ezy_left)) deallocate(this%pml_psi_Ezy_left)
+       if (associated(this%pml_psi_Hxy_left)) deallocate(this%pml_psi_Hxy_left)
+       if (associated(this%pml_psi_Hzy_left)) deallocate(this%pml_psi_Hzy_left)
+       if (associated(this%pml_P_be_y_left)) deallocate(this%pml_P_be_y_left)
+       if (associated(this%pml_P_ce_y_left)) deallocate(this%pml_P_ce_y_left)
+       if (associated(this%pml_P_bm_y_left)) deallocate(this%pml_P_bm_y_left)
+       if (associated(this%pml_P_cm_y_left)) deallocate(this%pml_P_cm_y_left)
+
+       ! Deallocate CPML down boundary device memory
+       if (associated(this%pml_psi_Exz_down)) deallocate(this%pml_psi_Exz_down)
+       if (associated(this%pml_psi_Eyz_down)) deallocate(this%pml_psi_Eyz_down)
+       if (associated(this%pml_psi_Hxz_down)) deallocate(this%pml_psi_Hxz_down)
+       if (associated(this%pml_psi_Hyz_down)) deallocate(this%pml_psi_Hyz_down)
+       if (associated(this%pml_P_be_z_down)) deallocate(this%pml_P_be_z_down)
+       if (associated(this%pml_P_ce_z_down)) deallocate(this%pml_P_ce_z_down)
+       if (associated(this%pml_P_bm_z_down)) deallocate(this%pml_P_bm_z_down)
+       if (associated(this%pml_P_cm_z_down)) deallocate(this%pml_P_cm_z_down)
+
+       ! Deallocate CPML back boundary device memory
+       if (associated(this%pml_psi_Ezx_back)) deallocate(this%pml_psi_Ezx_back)
+       if (associated(this%pml_psi_Eyx_back)) deallocate(this%pml_psi_Eyx_back)
+       if (associated(this%pml_psi_Hzx_back)) deallocate(this%pml_psi_Hzx_back)
+       if (associated(this%pml_psi_Hyx_back)) deallocate(this%pml_psi_Hyx_back)
+       if (associated(this%pml_P_be_x_back)) deallocate(this%pml_P_be_x_back)
+       if (associated(this%pml_P_ce_x_back)) deallocate(this%pml_P_ce_x_back)
+       if (associated(this%pml_P_bm_x_back)) deallocate(this%pml_P_bm_x_back)
+       if (associated(this%pml_P_cm_x_back)) deallocate(this%pml_P_cm_x_back)
+
+        ! Deallocate MUR device memory
+        if (associated(this%mur_left_CAB1)) deallocate(this%mur_left_CAB1)
+        if (associated(this%mur_left_CAB3)) deallocate(this%mur_left_CAB3)
+        if (associated(this%mur_left_cab4)) deallocate(this%mur_left_cab4)
+        if (associated(this%mur_right_CAB1)) deallocate(this%mur_right_CAB1)
+        if (associated(this%mur_right_CAB3)) deallocate(this%mur_right_CAB3)
+        if (associated(this%mur_right_cab4)) deallocate(this%mur_right_cab4)
+        if (associated(this%mur_down_CAB1)) deallocate(this%mur_down_CAB1)
+        if (associated(this%mur_down_CAB3)) deallocate(this%mur_down_CAB3)
+        if (associated(this%mur_down_cab4)) deallocate(this%mur_down_cab4)
+        if (associated(this%mur_up_CAB1)) deallocate(this%mur_up_CAB1)
+        if (associated(this%mur_up_CAB3)) deallocate(this%mur_up_CAB3)
+        if (associated(this%mur_up_cab4)) deallocate(this%mur_up_cab4)
+        if (associated(this%mur_back_CAB1)) deallocate(this%mur_back_CAB1)
+        if (associated(this%mur_back_CAB3)) deallocate(this%mur_back_CAB3)
+        if (associated(this%mur_back_cab4)) deallocate(this%mur_back_cab4)
+        if (associated(this%mur_front_CAB1)) deallocate(this%mur_front_CAB1)
+        if (associated(this%mur_front_CAB3)) deallocate(this%mur_front_CAB3)
+        if (associated(this%mur_front_cab4)) deallocate(this%mur_front_cab4)
+        if (associated(this%mur_past_Hx_left)) deallocate(this%mur_past_Hx_left)
+        if (associated(this%mur_past_Hz_left)) deallocate(this%mur_past_Hz_left)
+        if (associated(this%mur_past_Hx_right)) deallocate(this%mur_past_Hx_right)
+        if (associated(this%mur_past_Hz_right)) deallocate(this%mur_past_Hz_right)
+        if (associated(this%mur_past_Hy_down)) deallocate(this%mur_past_Hy_down)
+        if (associated(this%mur_past_Hx_down)) deallocate(this%mur_past_Hx_down)
+        if (associated(this%mur_past_Hy_up)) deallocate(this%mur_past_Hy_up)
+        if (associated(this%mur_past_Hx_up)) deallocate(this%mur_past_Hx_up)
+        if (associated(this%mur_past_Hz_back)) deallocate(this%mur_past_Hz_back)
+        if (associated(this%mur_past_Hy_back)) deallocate(this%mur_past_Hy_back)
+        if (associated(this%mur_past_Hz_front)) deallocate(this%mur_past_Hz_front)
+        if (associated(this%mur_past_Hy_front)) deallocate(this%mur_past_Hy_front)
+
+        ! Nullify host pointers
+       nullify(this%Ex); nullify(this%Ey); nullify(this%Ez)
+       nullify(this%Hx); nullify(this%Hy); nullify(this%Hz)
+       nullify(this%sggMiEx); nullify(this%sggMiEy); nullify(this%sggMiEz)
+       nullify(this%sggMiHx); nullify(this%sggMiHy); nullify(this%sggMiHz)
+       nullify(this%g1); nullify(this%g2); nullify(this%gm1); nullify(this%gm2)
+       nullify(this%Idxe); nullify(this%Idye); nullify(this%Idze)
+       nullify(this%Idxh); nullify(this%Idyh); nullify(this%Idzh)
+       nullify(this%dxe); nullify(this%dye); nullify(this%dze)
+       nullify(this%dxh); nullify(this%dyh); nullify(this%dzh)
+
+     this%initialized = .false.
+       this%fields_on_device = .false.
+       this%pml_left_initialized = .false.
+       this%pml_right_initialized = .false.
+       this%pml_down_initialized = .false.
+this%pml_up_initialized = .false.
+        this%pml_back_initialized = .false.
+        this%pml_front_initialized = .false.
+
+        ! Deallocate probe buffers
+        if (associated(this%probe_results_d)) deallocate(this%probe_results_d)
+        if (associated(this%probe_field_ids_d)) deallocate(this%probe_field_ids_d)
+        if (associated(this%probe_I_d)) deallocate(this%probe_I_d)
+        if (associated(this%probe_J_d)) deallocate(this%probe_J_d)
+        if (associated(this%probe_K_d)) deallocate(this%probe_K_d)
+        if (associated(this%block_probe_results_d)) deallocate(this%block_probe_results_d)
+        if (associated(this%block_probe_field_ids_d)) deallocate(this%block_probe_field_ids_d)
+        if (associated(this%block_probe_I1_d)) deallocate(this%block_probe_I1_d)
+        if (associated(this%block_probe_J1_d)) deallocate(this%block_probe_J1_d)
+        if (associated(this%block_probe_K1_d)) deallocate(this%block_probe_K1_d)
+        if (associated(this%block_probe_I2_d)) deallocate(this%block_probe_I2_d)
+        if (associated(this%block_probe_J2_d)) deallocate(this%block_probe_J2_d)
+     if (associated(this%block_probe_K2_d)) deallocate(this%block_probe_K2_d)
+
+      ! Deallocate flattened MUR arrays
+        if (associated(this%mur_bounds_d)) deallocate(this%mur_bounds_d)
+        if (associated(this%mur_cab1_d)) deallocate(this%mur_cab1_d)
+
+      end subroutine gpu_destroy
+
+    !--------------------------------------------------------------------------------
+    ! Update CPML down/up boundary coefficients on device - called every step
+    !--------------------------------------------------------------------------------
+    subroutine gpu_update_pml_down_coeffs(this, P_be_z, P_ce_z, P_bm_z, P_cm_z)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:), intent(in) :: P_be_z, P_ce_z, P_bm_z, P_cm_z
+
+       if (.not. this%pml_down_initialized) return
+
+       this%pml_P_be_z_down = P_be_z
+       this%pml_P_ce_z_down = P_ce_z
+       this%pml_P_bm_z_down = P_bm_z
+       this%pml_P_cm_z_down = P_cm_z
+
+    end subroutine gpu_update_pml_down_coeffs
+
+    !--------------------------------------------------------------------------------
+    ! Update CPML back/front boundary coefficients on device - called every step
+    !--------------------------------------------------------------------------------
+    subroutine gpu_update_pml_back_coeffs(this, P_be_x, P_ce_x, P_bm_x, P_cm_x)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:), intent(in) :: P_be_x, P_ce_x, P_bm_x, P_cm_x
+
+       if (.not. this%pml_back_initialized) return
+
+       this%pml_P_be_x_back = P_be_x
+       this%pml_P_ce_x_back = P_ce_x
+       this%pml_P_bm_x_back = P_bm_x
+       this%pml_P_cm_x_back = P_cm_x
+
+    end subroutine gpu_update_pml_back_coeffs
+
+    !--------------------------------------------------------------------------------
+    ! Initialize CPML down boundary on GPU
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_pml_down(this, P_be_z, P_ce_z, P_bm_z, P_cm_z, &
+                                 Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, &
+                                 Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj, &
+                                 Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, &
+                                 Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:), intent(in) :: P_be_z, P_ce_z, P_bm_z, P_cm_z
+       integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj
+       integer(kind=4), intent(in) :: Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj
+       integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj
+       integer(kind=4), intent(in) :: Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj
+
+       integer(kind=4) :: z_lo, z_hi
+
+       if (.not. this%initialized) return
+
+       this%pml_down_Ex_ii = Ex_iEx_ii + 1; this%pml_down_Ex_ij = Ex_iEx_ij + 1
+       this%pml_down_Ex_ji = Ex_iEx_ji + 1; this%pml_down_Ex_jj = Ex_iEx_jj + 1
+       this%pml_down_Ex_ki = Ex_iEx_ki + 1; this%pml_down_Ex_kj = Ex_iEx_kj + 1
+
+       this%pml_down_Ey_ii = Ex_iEy_ii + 1; this%pml_down_Ey_ij = Ex_iEy_ij + 1
+       this%pml_down_Ey_ji = Ex_iEy_ji + 1; this%pml_down_Ey_jj = Ex_iEy_jj + 1
+       this%pml_down_Ey_ki = Ex_iEy_ki + 1; this%pml_down_Ey_kj = Ex_iEy_kj + 1
+
+       this%pml_down_Hx_ii = Hx_iHx_ii + 1; this%pml_down_Hx_ij = Hx_iHx_ij + 1
+       this%pml_down_Hx_ji = Hx_iHx_ji + 1; this%pml_down_Hx_jj = Hx_iHx_jj + 1
+       this%pml_down_Hx_ki = Hx_iHx_ki + 1; this%pml_down_Hx_kj = Hx_iHx_kj + 1
+
+       this%pml_down_Hy_ii = Hx_iHy_ii + 1; this%pml_down_Hy_ij = Hx_iHy_ij + 1
+       this%pml_down_Hy_ji = Hx_iHy_ji + 1; this%pml_down_Hy_jj = Hx_iHy_jj + 1
+       this%pml_down_Hy_ki = Hx_iHy_ki + 1; this%pml_down_Hy_kj = Hx_iHy_kj + 1
+
+       this%pml_down_Exz_nx = Ex_iEx_ij - Ex_iEx_ii + 1
+       this%pml_down_Exz_ny = Ex_iEx_jj - Ex_iEx_ji + 1
+       this%pml_down_Exz_nz = Ex_iEx_kj - Ex_iEx_ki + 1
+       allocate(this%pml_psi_Exz_down(this%pml_down_Exz_nx, this%pml_down_Exz_ny, this%pml_down_Exz_nz))
+
+       this%pml_down_Eyz_nx = Ex_iEy_ij - Ex_iEy_ii + 1
+       this%pml_down_Eyz_ny = Ex_iEy_jj - Ex_iEy_ji + 1
+       this%pml_down_Eyz_nz = Ex_iEy_kj - Ex_iEy_ki + 1
+       allocate(this%pml_psi_Eyz_down(this%pml_down_Eyz_nx, this%pml_down_Eyz_ny, this%pml_down_Eyz_nz))
+
+       this%pml_down_Hxz_nx = Hx_iHx_ij - Hx_iHx_ii + 1
+       this%pml_down_Hxz_ny = Hx_iHx_jj - Hx_iHx_ji + 1
+       this%pml_down_Hxz_nz = Hx_iHx_kj - Hx_iHx_ki + 1
+       allocate(this%pml_psi_Hxz_down(this%pml_down_Hxz_nx, this%pml_down_Hxz_ny, this%pml_down_Hxz_nz))
+
+       this%pml_down_Hyz_nx = Hx_iHy_ij - Hx_iHy_ii + 1
+       this%pml_down_Hyz_ny = Hx_iHy_jj - Hx_iHy_ji + 1
+       this%pml_down_Hyz_nz = Hx_iHy_kj - Hx_iHy_ki + 1
+       allocate(this%pml_psi_Hyz_down(this%pml_down_Hyz_nx, this%pml_down_Hyz_ny, this%pml_down_Hyz_nz))
+
+       this%pml_psi_Exz_down = 0.0_rkind
+       this%pml_psi_Eyz_down = 0.0_rkind
+       this%pml_psi_Hxz_down = 0.0_rkind
+       this%pml_psi_Hyz_down = 0.0_rkind
+
+       z_lo = lbound(P_be_z, 1)
+       z_hi = ubound(P_be_z, 1)
+       this%pml_coeff_z_n = z_hi - z_lo + 1
+       allocate(this%pml_P_be_z_down(z_lo:z_hi))
+       allocate(this%pml_P_ce_z_down(z_lo:z_hi))
+       allocate(this%pml_P_bm_z_down(z_lo:z_hi))
+       allocate(this%pml_P_cm_z_down(z_lo:z_hi))
+
+       this%pml_P_be_z_down = P_be_z
+       this%pml_P_ce_z_down = P_ce_z
+       this%pml_P_bm_z_down = P_bm_z
+       this%pml_P_cm_z_down = P_cm_z
+
+       this%pml_down_initialized = .true.
+
+    end subroutine gpu_init_pml_down
+
+    !--------------------------------------------------------------------------------
+    ! Initialize CPML up boundary on GPU
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_pml_up(this, &
+                               Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj, &
+                               Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj, &
+                               Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj, &
+                               Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj)
+       class(gpu_state_t), intent(inout) :: this
+       integer(kind=4), intent(in) :: Ex_iEx_ii, Ex_iEx_ij, Ex_iEx_ji, Ex_iEx_jj, Ex_iEx_ki, Ex_iEx_kj
+       integer(kind=4), intent(in) :: Ex_iEy_ii, Ex_iEy_ij, Ex_iEy_ji, Ex_iEy_jj, Ex_iEy_ki, Ex_iEy_kj
+       integer(kind=4), intent(in) :: Hx_iHx_ii, Hx_iHx_ij, Hx_iHx_ji, Hx_iHx_jj, Hx_iHx_ki, Hx_iHx_kj
+       integer(kind=4), intent(in) :: Hx_iHy_ii, Hx_iHy_ij, Hx_iHy_ji, Hx_iHy_jj, Hx_iHy_ki, Hx_iHy_kj
+
+       if (.not. this%initialized) return
+
+       this%pml_up_Ex_ii = Ex_iEx_ii + 1; this%pml_up_Ex_ij = Ex_iEx_ij + 1
+       this%pml_up_Ex_ji = Ex_iEx_ji + 1; this%pml_up_Ex_jj = Ex_iEx_jj + 1
+       this%pml_up_Ex_ki = Ex_iEx_ki + 1; this%pml_up_Ex_kj = Ex_iEx_kj + 1
+
+       this%pml_up_Ey_ii = Ex_iEy_ii + 1; this%pml_up_Ey_ij = Ex_iEy_ij + 1
+       this%pml_up_Ey_ji = Ex_iEy_ji + 1; this%pml_up_Ey_jj = Ex_iEy_jj + 1
+       this%pml_up_Ey_ki = Ex_iEy_ki + 1; this%pml_up_Ey_kj = Ex_iEy_kj + 1
+
+       this%pml_up_Hx_ii = Hx_iHx_ii + 1; this%pml_up_Hx_ij = Hx_iHx_ij + 1
+       this%pml_up_Hx_ji = Hx_iHx_ji + 1; this%pml_up_Hx_jj = Hx_iHx_jj + 1
+       this%pml_up_Hx_ki = Hx_iHx_ki + 1; this%pml_up_Hx_kj = Hx_iHx_kj + 1
+
+       this%pml_up_Hy_ii = Hx_iHy_ii + 1; this%pml_up_Hy_ij = Hx_iHy_ij + 1
+       this%pml_up_Hy_ji = Hx_iHy_ji + 1; this%pml_up_Hy_jj = Hx_iHy_jj + 1
+       this%pml_up_Hy_ki = Hx_iHy_ki + 1; this%pml_up_Hy_kj = Hx_iHy_kj + 1
+
+       this%pml_up_initialized = .true.
+
+    end subroutine gpu_init_pml_up
+
+    !--------------------------------------------------------------------------------
+    ! Initialize CPML back boundary on GPU
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_pml_back(this, P_be_x, P_ce_x, P_bm_x, P_cm_x, &
+                                 Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj, &
+                                 Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj, &
+                                 Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj, &
+                                 Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:), intent(in) :: P_be_x, P_ce_x, P_bm_x, P_cm_x
+       integer(kind=4), intent(in) :: Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj
+       integer(kind=4), intent(in) :: Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj
+       integer(kind=4), intent(in) :: Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj
+       integer(kind=4), intent(in) :: Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj
+
+       integer(kind=4) :: x_lo, x_hi
+
+       if (.not. this%initialized) return
+
+       this%pml_back_Ez_ii = Ez_iEz_ii + 1; this%pml_back_Ez_ij = Ez_iEz_ij + 1
+       this%pml_back_Ez_ji = Ez_iEz_ji + 1; this%pml_back_Ez_jj = Ez_iEz_jj + 1
+       this%pml_back_Ez_ki = Ez_iEz_ki + 1; this%pml_back_Ez_kj = Ez_iEz_kj + 1
+
+       this%pml_back_Ey_ii = Ez_iEy_ii + 1; this%pml_back_Ey_ij = Ez_iEy_ij + 1
+       this%pml_back_Ey_ji = Ez_iEy_ji + 1; this%pml_back_Ey_jj = Ez_iEy_jj + 1
+       this%pml_back_Ey_ki = Ez_iEy_ki + 1; this%pml_back_Ey_kj = Ez_iEy_kj + 1
+
+       this%pml_back_Hz_ii = Hz_iHz_ii + 1; this%pml_back_Hz_ij = Hz_iHz_ij + 1
+       this%pml_back_Hz_ji = Hz_iHz_ji + 1; this%pml_back_Hz_jj = Hz_iHz_jj + 1
+       this%pml_back_Hz_ki = Hz_iHz_ki + 1; this%pml_back_Hz_kj = Hz_iHz_kj + 1
+
+       this%pml_back_Hy_ii = Hz_iHy_ii + 1; this%pml_back_Hy_ij = Hz_iHy_ij + 1
+       this%pml_back_Hy_ji = Hz_iHy_ji + 1; this%pml_back_Hy_jj = Hz_iHy_jj + 1
+       this%pml_back_Hy_ki = Hz_iHy_ki + 1; this%pml_back_Hy_kj = Hz_iHy_kj + 1
+
+       this%pml_back_Ezx_nx = Ez_iEz_ij - Ez_iEz_ii + 1
+       this%pml_back_Ezx_ny = Ez_iEz_jj - Ez_iEz_ji + 1
+       this%pml_back_Ezx_nz = Ez_iEz_kj - Ez_iEz_ki + 1
+       allocate(this%pml_psi_Ezx_back(this%pml_back_Ezx_nx, this%pml_back_Ezx_ny, this%pml_back_Ezx_nz))
+
+       this%pml_back_Eyx_nx = Ez_iEy_ij - Ez_iEy_ii + 1
+       this%pml_back_Eyx_ny = Ez_iEy_jj - Ez_iEy_ji + 1
+       this%pml_back_Eyx_nz = Ez_iEy_kj - Ez_iEy_ki + 1
+       allocate(this%pml_psi_Eyx_back(this%pml_back_Eyx_nx, this%pml_back_Eyx_ny, this%pml_back_Eyx_nz))
+
+       this%pml_back_Hzx_nx = Hz_iHz_ij - Hz_iHz_ii + 1
+       this%pml_back_Hzx_ny = Hz_iHz_jj - Hz_iHz_ji + 1
+       this%pml_back_Hzx_nz = Hz_iHz_kj - Hz_iHz_ki + 1
+       allocate(this%pml_psi_Hzx_back(this%pml_back_Hzx_nx, this%pml_back_Hzx_ny, this%pml_back_Hzx_nz))
+
+       this%pml_back_Hyx_nx = Hz_iHy_ij - Hz_iHy_ii + 1
+       this%pml_back_Hyx_ny = Hz_iHy_jj - Hz_iHy_ji + 1
+       this%pml_back_Hyx_nz = Hz_iHy_kj - Hz_iHy_ki + 1
+       allocate(this%pml_psi_Hyx_back(this%pml_back_Hyx_nx, this%pml_back_Hyx_ny, this%pml_back_Hyx_nz))
+
+       this%pml_psi_Ezx_back = 0.0_rkind
+       this%pml_psi_Eyx_back = 0.0_rkind
+       this%pml_psi_Hzx_back = 0.0_rkind
+       this%pml_psi_Hyx_back = 0.0_rkind
+
+       x_lo = lbound(P_be_x, 1)
+       x_hi = ubound(P_be_x, 1)
+       this%pml_coeff_x_n = x_hi - x_lo + 1
+       allocate(this%pml_P_be_x_back(x_lo:x_hi))
+       allocate(this%pml_P_ce_x_back(x_lo:x_hi))
+       allocate(this%pml_P_bm_x_back(x_lo:x_hi))
+       allocate(this%pml_P_cm_x_back(x_lo:x_hi))
+
+       this%pml_P_be_x_back = P_be_x
+       this%pml_P_ce_x_back = P_ce_x
+       this%pml_P_bm_x_back = P_bm_x
+       this%pml_P_cm_x_back = P_cm_x
+
+       this%pml_back_initialized = .true.
+
+    end subroutine gpu_init_pml_back
+
+    !--------------------------------------------------------------------------------
+    ! Initialize CPML front boundary on GPU
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_pml_front(this, &
+                                  Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj, &
+                                  Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj, &
+                                  Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj, &
+                                  Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj)
+       class(gpu_state_t), intent(inout) :: this
+       integer(kind=4), intent(in) :: Ez_iEz_ii, Ez_iEz_ij, Ez_iEz_ji, Ez_iEz_jj, Ez_iEz_ki, Ez_iEz_kj
+       integer(kind=4), intent(in) :: Ez_iEy_ii, Ez_iEy_ij, Ez_iEy_ji, Ez_iEy_jj, Ez_iEy_ki, Ez_iEy_kj
+       integer(kind=4), intent(in) :: Hz_iHz_ii, Hz_iHz_ij, Hz_iHz_ji, Hz_iHz_jj, Hz_iHz_ki, Hz_iHz_kj
+       integer(kind=4), intent(in) :: Hz_iHy_ii, Hz_iHy_ij, Hz_iHy_ji, Hz_iHy_jj, Hz_iHy_ki, Hz_iHy_kj
+
+       if (.not. this%initialized) return
+
+       this%pml_front_Ez_ii = Ez_iEz_ii + 1; this%pml_front_Ez_ij = Ez_iEz_ij + 1
+       this%pml_front_Ez_ji = Ez_iEz_ji + 1; this%pml_front_Ez_jj = Ez_iEz_jj + 1
+       this%pml_front_Ez_ki = Ez_iEz_ki + 1; this%pml_front_Ez_kj = Ez_iEz_kj + 1
+
+       this%pml_front_Ey_ii = Ez_iEy_ii + 1; this%pml_front_Ey_ij = Ez_iEy_ij + 1
+       this%pml_front_Ey_ji = Ez_iEy_ji + 1; this%pml_front_Ey_jj = Ez_iEy_jj + 1
+       this%pml_front_Ey_ki = Ez_iEy_ki + 1; this%pml_front_Ey_kj = Ez_iEy_kj + 1
+
+       this%pml_front_Hz_ii = Hz_iHz_ii + 1; this%pml_front_Hz_ij = Hz_iHz_ij + 1
+       this%pml_front_Hz_ji = Hz_iHz_ji + 1; this%pml_front_Hz_jj = Hz_iHz_jj + 1
+       this%pml_front_Hz_ki = Hz_iHz_ki + 1; this%pml_front_Hz_kj = Hz_iHz_kj + 1
+
+       this%pml_front_Hy_ii = Hz_iHy_ii + 1; this%pml_front_Hy_ij = Hz_iHy_ij + 1
+       this%pml_front_Hy_ji = Hz_iHy_ji + 1; this%pml_front_Hy_jj = Hz_iHy_jj + 1
+       this%pml_front_Hy_ki = Hz_iHy_ki + 1; this%pml_front_Hy_kj = Hz_iHy_kj + 1
+
+       this%pml_front_initialized = .true.
+
+    end subroutine gpu_init_pml_front
+
+   !--------------------------------------------------------------------------------
+   ! Upload host data to device - called only when fields are modified on host
+   !--------------------------------------------------------------------------------
+  subroutine gpu_upload(this)
+       class(gpu_state_t), intent(inout) :: this
+
+       if (.not. this%initialized) return
+
+       this%Ex_d = this%Ex
+       this%Ey_d = this%Ey
+       this%Ez_d = this%Ez
+       this%Hx_d = this%Hx
+       this%Hy_d = this%Hy
+       this%Hz_d = this%Hz
+       this%fields_on_device = .true.
+       this%last_download_step = -1024
+
+    end subroutine gpu_upload
+
+   !--------------------------------------------------------------------------------
+   ! Update CPML left boundary coefficients on device - called every step
+   !--------------------------------------------------------------------------------
+   subroutine gpu_update_pml_left_coeffs(this, P_be_y, P_ce_y, P_bm_y, P_cm_y)
+      class(gpu_state_t), intent(inout) :: this
+      real(kind=rkind), dimension(:), intent(in) :: P_be_y, P_ce_y, P_bm_y, P_cm_y
+
+      if (.not. this%pml_left_initialized) return
+
+      this%pml_P_be_y_left = P_be_y
+      this%pml_P_ce_y_left = P_ce_y
+      this%pml_P_bm_y_left = P_bm_y
+      this%pml_P_cm_y_left = P_cm_y
+
+   end subroutine gpu_update_pml_left_coeffs
+
+    !--------------------------------------------------------------------------------
+    ! Initialize MUR boundary coefficients on GPU - called after InitMURBorders
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_mur_coeffs(this, numMedia, &
+                                   left_CAB1, left_CAB3, left_cab4, &
+                                   right_CAB1, right_CAB3, right_cab4, &
+                                   down_CAB1, down_CAB3, down_cab4, &
+                                   up_CAB1, up_CAB3, up_cab4, &
+                                   back_CAB1, back_CAB3, back_cab4, &
+                                   front_CAB1, front_CAB3, front_cab4)
+       class(gpu_state_t), intent(inout) :: this
+       integer(kind=4), intent(in) :: numMedia
+       real(kind=rkind), dimension(:), intent(in) :: left_CAB1, left_CAB3, left_cab4
+       real(kind=rkind), dimension(:), intent(in) :: right_CAB1, right_CAB3, right_cab4
+       real(kind=rkind), dimension(:), intent(in) :: down_CAB1, down_CAB3, down_cab4
+       real(kind=rkind), dimension(:), intent(in) :: up_CAB1, up_CAB3, up_cab4
+       real(kind=rkind), dimension(:), intent(in) :: back_CAB1, back_CAB3, back_cab4
+       real(kind=rkind), dimension(:), intent(in) :: front_CAB1, front_CAB3, front_cab4
+
+       integer(kind=4) :: lo, hi
+
+       if (.not. this%initialized) return
+
+       this%mur_numMedia = numMedia
+
+       lo = lbound(left_CAB1, 1); hi = ubound(left_CAB1, 1)
+       allocate(this%mur_left_CAB1(lo:hi)); this%mur_left_CAB1 = left_CAB1
+       allocate(this%mur_left_CAB3(lo:hi)); this%mur_left_CAB3 = left_CAB3
+       allocate(this%mur_left_cab4(lo:hi)); this%mur_left_cab4 = left_cab4
+       allocate(this%mur_right_CAB1(lo:hi)); this%mur_right_CAB1 = right_CAB1
+       allocate(this%mur_right_CAB3(lo:hi)); this%mur_right_CAB3 = right_CAB3
+       allocate(this%mur_right_cab4(lo:hi)); this%mur_right_cab4 = right_cab4
+       allocate(this%mur_down_CAB1(lo:hi)); this%mur_down_CAB1 = down_CAB1
+       allocate(this%mur_down_CAB3(lo:hi)); this%mur_down_CAB3 = down_CAB3
+       allocate(this%mur_down_cab4(lo:hi)); this%mur_down_cab4 = down_cab4
+       allocate(this%mur_up_CAB1(lo:hi)); this%mur_up_CAB1 = up_CAB1
+       allocate(this%mur_up_CAB3(lo:hi)); this%mur_up_CAB3 = up_CAB3
+       allocate(this%mur_up_cab4(lo:hi)); this%mur_up_cab4 = up_cab4
+       allocate(this%mur_back_CAB1(lo:hi)); this%mur_back_CAB1 = back_CAB1
+       allocate(this%mur_back_CAB3(lo:hi)); this%mur_back_CAB3 = back_CAB3
+       allocate(this%mur_back_cab4(lo:hi)); this%mur_back_cab4 = back_cab4
+      allocate(this%mur_front_CAB1(lo:hi)); this%mur_front_CAB1 = front_CAB1
+        allocate(this%mur_front_CAB3(lo:hi)); this%mur_front_CAB3 = front_CAB3
+        allocate(this%mur_front_cab4(lo:hi)); this%mur_front_cab4 = front_cab4
+
+     ! Flattened CAB1 array: mur_cab1_d(12 * numMedia)
+        ! Layout: cab1_d((bound_id-1)*numMedia + medio) for contiguous access
+        if (associated(this%mur_cab1_d)) deallocate(this%mur_cab1_d)
+        allocate(this%mur_cab1_d(12 * numMedia))
+        this%mur_cab1_d(1:numMedia) = left_CAB1
+        this%mur_cab1_d(numMedia+1:2*numMedia) = left_CAB1
+        this%mur_cab1_d(2*numMedia+1:3*numMedia) = right_CAB1
+        this%mur_cab1_d(3*numMedia+1:4*numMedia) = right_CAB1
+        this%mur_cab1_d(4*numMedia+1:5*numMedia) = down_CAB1
+        this%mur_cab1_d(5*numMedia+1:6*numMedia) = down_CAB1
+        this%mur_cab1_d(6*numMedia+1:7*numMedia) = up_CAB1
+        this%mur_cab1_d(7*numMedia+1:8*numMedia) = up_CAB1
+        this%mur_cab1_d(8*numMedia+1:9*numMedia) = back_CAB1
+        this%mur_cab1_d(9*numMedia+1:10*numMedia) = back_CAB1
+        this%mur_cab1_d(10*numMedia+1:11*numMedia) = front_CAB1
+        this%mur_cab1_d(11*numMedia+1:12*numMedia) = front_CAB1
+
+        this%mur_initialized = .true.
+
+    end subroutine gpu_init_mur_coeffs
+
+    !--------------------------------------------------------------------------------
+    ! Update MUR coefficients on device - called every step
+    !--------------------------------------------------------------------------------
+    subroutine gpu_update_mur_coeffs(this, &
+                                     left_CAB1, left_CAB3, left_cab4, &
+                                     right_CAB1, right_CAB3, right_cab4, &
+                                     down_CAB1, down_CAB3, down_cab4, &
+                                     up_CAB1, up_CAB3, up_cab4, &
+                                     back_CAB1, back_CAB3, back_cab4, &
+                                     front_CAB1, front_CAB3, front_cab4)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:), intent(in) :: left_CAB1, left_CAB3, left_cab4
+       real(kind=rkind), dimension(:), intent(in) :: right_CAB1, right_CAB3, right_cab4
+       real(kind=rkind), dimension(:), intent(in) :: down_CAB1, down_CAB3, down_cab4
+       real(kind=rkind), dimension(:), intent(in) :: up_CAB1, up_CAB3, up_cab4
+       real(kind=rkind), dimension(:), intent(in) :: back_CAB1, back_CAB3, back_cab4
+       real(kind=rkind), dimension(:), intent(in) :: front_CAB1, front_CAB3, front_cab4
+
+       if (.not. this%mur_initialized) return
+
+       this%mur_left_CAB1 = left_CAB1; this%mur_left_CAB3 = left_CAB3; this%mur_left_cab4 = left_cab4
+       this%mur_right_CAB1 = right_CAB1; this%mur_right_CAB3 = right_CAB3; this%mur_right_cab4 = right_cab4
+       this%mur_down_CAB1 = down_CAB1; this%mur_down_CAB3 = down_CAB3; this%mur_down_cab4 = down_cab4
+       this%mur_up_CAB1 = up_CAB1; this%mur_up_CAB3 = up_CAB3; this%mur_up_cab4 = up_cab4
+       this%mur_back_CAB1 = back_CAB1; this%mur_back_CAB3 = back_CAB3; this%mur_back_cab4 = back_cab4
+       this%mur_front_CAB1 = front_CAB1; this%mur_front_CAB3 = front_CAB3; this%mur_front_cab4 = front_cab4
+
+    end subroutine gpu_update_mur_coeffs
+
+    !--------------------------------------------------------------------------------
+    ! Initialize MUR past-field arrays on GPU - called after InitMURBorders
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_mur_past_fields(this, left_Hx_nx, left_Hx_ny, left_Hx_nz, left_Hz_nx, left_Hz_ny, left_Hz_nz, right_Hx_nx, right_Hx_ny, right_Hx_nz, right_Hz_nx, right_Hz_ny, right_Hz_nz, down_Hy_nx, down_Hy_ny, down_Hy_nz, down_Hx_nx, down_Hx_ny, down_Hx_nz, up_Hy_nx, up_Hy_ny, up_Hy_nz, up_Hx_nx, up_Hx_ny, up_Hx_nz, back_Hz_nx, back_Hz_ny, back_Hz_nz, back_Hy_nx, back_Hy_ny, back_Hy_nz, front_Hz_nx, front_Hz_ny, front_Hz_nz, front_Hy_nx, front_Hy_ny, front_Hy_nz, left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy)
+       class(gpu_state_t), intent(inout) :: this
+       integer(kind=4), intent(in) :: left_Hx_nx, left_Hx_ny, left_Hx_nz
+       integer(kind=4), intent(in) :: left_Hz_nx, left_Hz_ny, left_Hz_nz
+       integer(kind=4), intent(in) :: right_Hx_nx, right_Hx_ny, right_Hx_nz
+       integer(kind=4), intent(in) :: right_Hz_nx, right_Hz_ny, right_Hz_nz
+       integer(kind=4), intent(in) :: down_Hy_nx, down_Hy_ny, down_Hy_nz
+       integer(kind=4), intent(in) :: down_Hx_nx, down_Hx_ny, down_Hx_nz
+       integer(kind=4), intent(in) :: up_Hy_nx, up_Hy_ny, up_Hy_nz
+       integer(kind=4), intent(in) :: up_Hx_nx, up_Hx_ny, up_Hx_nz
+       integer(kind=4), intent(in) :: back_Hz_nx, back_Hz_ny, back_Hz_nz
+       integer(kind=4), intent(in) :: back_Hy_nx, back_Hy_ny, back_Hy_nz
+       integer(kind=4), intent(in) :: front_Hz_nx, front_Hz_ny, front_Hz_nz
+       integer(kind=4), intent(in) :: front_Hy_nx, front_Hy_ny, front_Hy_nz
+       real(kind=rkind), dimension(:,:,:), intent(in) :: left_Hx, left_Hz, right_Hx, right_Hz
+       real(kind=rkind), dimension(:,:,:), intent(in) :: down_Hy, down_Hx, up_Hy, up_Hx
+       real(kind=rkind), dimension(:,:,:), intent(in) :: back_Hz, back_Hy, front_Hz, front_Hy
+
+      if (.not. this%initialized) return
+
+        allocate(this%mur_past_Hx_left(lbound(left_Hx,1):ubound(left_Hx,1), lbound(left_Hx,2):ubound(left_Hx,2), lbound(left_Hx,3):ubound(left_Hx,3)))
+        allocate(this%mur_past_Hz_left(lbound(left_Hz,1):ubound(left_Hz,1), lbound(left_Hz,2):ubound(left_Hz,2), lbound(left_Hz,3):ubound(left_Hz,3)))
+        allocate(this%mur_past_Hx_right(lbound(right_Hx,1):ubound(right_Hx,1), lbound(right_Hx,2):ubound(right_Hx,2), lbound(right_Hx,3):ubound(right_Hx,3)))
+        allocate(this%mur_past_Hz_right(lbound(right_Hz,1):ubound(right_Hz,1), lbound(right_Hz,2):ubound(right_Hz,2), lbound(right_Hz,3):ubound(right_Hz,3)))
+        allocate(this%mur_past_Hy_down(lbound(down_Hy,1):ubound(down_Hy,1), lbound(down_Hy,2):ubound(down_Hy,2), lbound(down_Hy,3):ubound(down_Hy,3)))
+        allocate(this%mur_past_Hx_down(lbound(down_Hx,1):ubound(down_Hx,1), lbound(down_Hx,2):ubound(down_Hx,2), lbound(down_Hx,3):ubound(down_Hx,3)))
+        allocate(this%mur_past_Hy_up(lbound(up_Hy,1):ubound(up_Hy,1), lbound(up_Hy,2):ubound(up_Hy,2), lbound(up_Hy,3):ubound(up_Hy,3)))
+        allocate(this%mur_past_Hx_up(lbound(up_Hx,1):ubound(up_Hx,1), lbound(up_Hx,2):ubound(up_Hx,2), lbound(up_Hx,3):ubound(up_Hx,3)))
+        allocate(this%mur_past_Hz_back(lbound(back_Hz,1):ubound(back_Hz,1), lbound(back_Hz,2):ubound(back_Hz,2), lbound(back_Hz,3):ubound(back_Hz,3)))
+        allocate(this%mur_past_Hy_back(lbound(back_Hy,1):ubound(back_Hy,1), lbound(back_Hy,2):ubound(back_Hy,2), lbound(back_Hy,3):ubound(back_Hy,3)))
+        allocate(this%mur_past_Hz_front(lbound(front_Hz,1):ubound(front_Hz,1), lbound(front_Hz,2):ubound(front_Hz,2), lbound(front_Hz,3):ubound(front_Hz,3)))
+        allocate(this%mur_past_Hy_front(lbound(front_Hy,1):ubound(front_Hy,1), lbound(front_Hy,2):ubound(front_Hy,2), lbound(front_Hy,3):ubound(front_Hy,3)))
+
+       this%mur_past_Hx_left = left_Hx; this%mur_past_Hz_left = left_Hz
+       this%mur_past_Hx_right = right_Hx; this%mur_past_Hz_right = right_Hz
+       this%mur_past_Hy_down = down_Hy; this%mur_past_Hx_down = down_Hx
+       this%mur_past_Hy_up = up_Hy; this%mur_past_Hx_up = up_Hx
+       this%mur_past_Hz_back = back_Hz; this%mur_past_Hy_back = back_Hy
+       this%mur_past_Hz_front = front_Hz; this%mur_past_Hy_front = front_Hy
+
+       this%mur_initialized = .true.
+
+    end subroutine gpu_init_mur_past_fields
+
+    !--------------------------------------------------------------------------------
+    ! Upload MUR past fields to device - called every step after CPU MUR update
+    !--------------------------------------------------------------------------------
+    subroutine gpu_upload_mur_past_fields(this, left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy)
+       class(gpu_state_t), intent(inout) :: this
+       real(kind=rkind), dimension(:,:,:), intent(in) :: left_Hx, left_Hz, right_Hx, right_Hz, down_Hy, down_Hx, up_Hy, up_Hx, back_Hz, back_Hy, front_Hz, front_Hy
+
+       if (.not. this%mur_initialized) return
+
+       this%mur_past_Hx_left = left_Hx; this%mur_past_Hz_left = left_Hz
+       this%mur_past_Hx_right = right_Hx; this%mur_past_Hz_right = right_Hz
+       this%mur_past_Hy_down = down_Hy; this%mur_past_Hx_down = down_Hx
+       this%mur_past_Hy_up = up_Hy; this%mur_past_Hx_up = up_Hx
+       this%mur_past_Hz_back = back_Hz; this%mur_past_Hy_back = back_Hy
+       this%mur_past_Hz_front = front_Hz; this%mur_past_Hy_front = front_Hy
+
+    end subroutine gpu_upload_mur_past_fields
+
+     !--------------------------------------------------------------------------------
+     ! Initialize MUR boundary limits on GPU
+     !--------------------------------------------------------------------------------
+     subroutine gpu_init_mur_limits(this, &
+                                     left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj, &
+                                     left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj, &
+                                     right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj, &
+                                     right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj, &
+                                     down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj, &
+                                     down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj, &
+                                     up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj, &
+                                     up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj, &
+                                     back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj, &
+                                     back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj, &
+                                     front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj, &
+                                     front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj)
+         class(gpu_state_t), intent(inout) :: this
+         integer(kind=4), intent(in) :: left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj
+         integer(kind=4), intent(in) :: left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj
+         integer(kind=4), intent(in) :: right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj
+         integer(kind=4), intent(in) :: right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj
+         integer(kind=4), intent(in) :: down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj
+         integer(kind=4), intent(in) :: down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj
+         integer(kind=4), intent(in) :: up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj
+         integer(kind=4), intent(in) :: up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj
+         integer(kind=4), intent(in) :: back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj
+         integer(kind=4), intent(in) :: back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj
+         integer(kind=4), intent(in) :: front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj
+         integer(kind=4), intent(in) :: front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj
+
+         ! Host-side bounds array: 12 boundaries x 6 limits (ii,ij,ji,jj,ki,kj)
+         integer(kind=4) :: host_bounds(12, 6)
+         integer(kind=4) :: i, j
+
+         if (.not. this%initialized) return
+
+         ! Populate host array with boundary limits
+         ! Boundary ID mapping: 1=left-Hx, 2=left-Hz, 3=right-Hx, 4=right-Hz,
+         !                       5=down-Hy, 6=down-Hx, 7=up-Hy, 8=up-Hx,
+         !                       9=back-Hz, 10=back-Hy, 11=front-Hz, 12=front-Hy
+         host_bounds(1, :) = [left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj]
+         host_bounds(2, :) = [left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj]
+         host_bounds(3, :) = [right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj]
+         host_bounds(4, :) = [right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj]
+         host_bounds(5, :) = [down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj]
+         host_bounds(6, :) = [down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj]
+         host_bounds(7, :) = [up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj]
+         host_bounds(8, :) = [up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj]
+         host_bounds(9, :) = [back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj]
+         host_bounds(10, :) = [back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj]
+         host_bounds(11, :) = [front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj]
+         host_bounds(12, :) = [front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj]
+
+         ! Allocate and copy to device
+         if (associated(this%mur_bounds_d)) deallocate(this%mur_bounds_d)
+         allocate(this%mur_bounds_d(12, 6))
+         this%mur_bounds_d(1:12, 1:6) = host_bounds(1:12, 1:6)
+
+         ! Also populate individual members for backward compatibility with existing kernel wrappers
+          this%mur_left_Hx_ii = left_Hx_ii; this%mur_left_Hx_ij = left_Hx_ij
+          this%mur_left_Hx_ji = left_Hx_ji; this%mur_left_Hx_jj = left_Hx_jj
+          this%mur_left_Hx_ki = left_Hx_ki; this%mur_left_Hx_kj = left_Hx_kj
+          this%mur_left_Hz_ii = left_Hz_ii; this%mur_left_Hz_ij = left_Hz_ij
+          this%mur_left_Hz_ji = left_Hz_ji; this%mur_left_Hz_jj = left_Hz_jj
+          this%mur_left_Hz_ki = left_Hz_ki; this%mur_left_Hz_kj = left_Hz_kj
+          this%mur_right_Hx_ii = right_Hx_ii; this%mur_right_Hx_ij = right_Hx_ij
+          this%mur_right_Hx_ji = right_Hx_ji; this%mur_right_Hx_jj = right_Hx_jj
+          this%mur_right_Hx_ki = right_Hx_ki; this%mur_right_Hx_kj = right_Hx_kj
+          this%mur_right_Hz_ii = right_Hz_ii; this%mur_right_Hz_ij = right_Hz_ij
+          this%mur_right_Hz_ji = right_Hz_ji; this%mur_right_Hz_jj = right_Hz_jj
+          this%mur_right_Hz_ki = right_Hz_ki; this%mur_right_Hz_kj = right_Hz_kj
+          this%mur_down_Hy_ii = down_Hy_ii; this%mur_down_Hy_ij = down_Hy_ij
+          this%mur_down_Hy_ji = down_Hy_ji; this%mur_down_Hy_jj = down_Hy_jj
+          this%mur_down_Hy_ki = down_Hy_ki; this%mur_down_Hy_kj = down_Hy_kj
+          this%mur_down_Hx_ii = down_Hx_ii; this%mur_down_Hx_ij = down_Hx_ij
+          this%mur_down_Hx_ji = down_Hx_ji; this%mur_down_Hx_jj = down_Hx_jj
+          this%mur_down_Hx_ki = down_Hx_ki; this%mur_down_Hx_kj = down_Hx_kj
+          this%mur_up_Hy_ii = up_Hy_ii; this%mur_up_Hy_ij = up_Hy_ij
+          this%mur_up_Hy_ji = up_Hy_ji; this%mur_up_Hy_jj = up_Hy_jj
+          this%mur_up_Hy_ki = up_Hy_ki; this%mur_up_Hy_kj = up_Hy_kj
+          this%mur_up_Hx_ii = up_Hx_ii; this%mur_up_Hx_ij = up_Hx_ij
+          this%mur_up_Hx_ji = up_Hx_ji; this%mur_up_Hx_jj = up_Hx_jj
+          this%mur_up_Hx_ki = up_Hx_ki; this%mur_up_Hx_kj = up_Hx_kj
+          this%mur_back_Hz_ii = back_Hz_ii; this%mur_back_Hz_ij = back_Hz_ij
+          this%mur_back_Hz_ji = back_Hz_ji; this%mur_back_Hz_jj = back_Hz_jj
+          this%mur_back_Hz_ki = back_Hz_ki; this%mur_back_Hz_kj = back_Hz_kj
+          this%mur_back_Hy_ii = back_Hy_ii; this%mur_back_Hy_ij = back_Hy_ij
+          this%mur_back_Hy_ji = back_Hy_ji; this%mur_back_Hy_jj = back_Hy_jj
+          this%mur_back_Hy_ki = back_Hy_ki; this%mur_back_Hy_kj = back_Hy_kj
+          this%mur_front_Hz_ii = front_Hz_ii; this%mur_front_Hz_ij = front_Hz_ij
+          this%mur_front_Hz_ji = front_Hz_ji; this%mur_front_Hz_jj = front_Hz_jj
+          this%mur_front_Hz_ki = front_Hz_ki; this%mur_front_Hz_kj = front_Hz_kj
+          this%mur_front_Hy_ii = front_Hy_ii; this%mur_front_Hy_ij = front_Hy_ij
+          this%mur_front_Hy_ji = front_Hy_ji; this%mur_front_Hy_jj = front_Hy_jj
+          this%mur_front_Hy_ki = front_Hy_ki; this%mur_front_Hy_kj = front_Hy_kj
+
+    end subroutine gpu_init_mur_limits
+
+   end module gpu_core_m
+
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU CORE PROBE EXTENSION - Probe-aware selective download
+!  Downloads only the cells that observation probes reference,
+!  eliminating per-timestep full-field downloads.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_core_probe_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Download only probe-relevant cells from device to host
+   ! Much smaller than downloading all 6 fields
+   !--------------------------------------------------------------------------------
+   subroutine gpu_download_probes(this, sgg, Ex, Ey, Ez, Hx, Hy, Hz)
+      class(gpu_state_t), intent(inout) :: this
+      type(SGGFDTDINFO_t), intent(in) :: sgg
+      real(kind=rkind), dimension(:,:,:), pointer, intent(inout) :: Ex, Ey, Ez, Hx, Hy, Hz
+
+      integer(kind=4) :: ii, i, field
+      integer(kind=4) :: I1, J1, K1, I2, J2, K2
+      integer(kind=4) :: iii, jjj, kkk
+      integer(kind=4) :: pointObservationCases(6), blockCurrentObservationCases(6)
+      integer(kind=4) :: i1_m, i2_m, j1_m, j2_m, k1_m, k2_m
+      logical :: is_point, is_block
+
+      pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz]
+      blockCurrentObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz]
+
+      if (.not. this%initialized) return
+      if (.not. this%fields_on_device) return
+
+      ! Check if any probe has invalid bounds (0,0,0) — indicates element-based probe
+      ! Element-based probes need full field download since cell locations
+      ! are determined by element geometry, not by simple cell ranges
+      do ii = 1, sgg%NumberRequest
+         do i = 1, sgg%Observation(ii)%nP
+            if (sgg%Observation(ii)%P(i)%XI == 0 .and. &
+                sgg%Observation(ii)%P(i)%YI == 0 .and. &
+                sgg%Observation(ii)%P(i)%ZI == 0) then
+               ! Element-based probe — fall back to full download
+               this%Ex = this%Ex_d
+               this%Ey = this%Ey_d
+               this%Ez = this%Ez_d
+               this%Hx = this%Hx_d
+               this%Hy = this%Hy_d
+               this%Hz = this%Hz_d
+               this%fields_on_device = .false.
+               return
+            end if
+         end do
+      end do
+
+      ! Point probes: download individual cells
+      do ii = 1, sgg%NumberRequest
+         do i = 1, sgg%Observation(ii)%nP
+            field = sgg%Observation(ii)%P(i)%what
+            if (field == nothing) cycle
+
+            I1 = sgg%Observation(ii)%P(i)%XI
+            J1 = sgg%Observation(ii)%P(i)%YI
+            K1 = sgg%Observation(ii)%P(i)%ZI
+
+            is_point = any(field == pointObservationCases)
+            is_block = any(field == blockCurrentObservationCases)
+
+            if (is_point) then
+               ! Download single cell from each field
+               Ex(I1,J1,K1) = this%Ex_d(I1,J1,K1)
+               Ey(I1,J1,K1) = this%Ey_d(I1,J1,K1)
+               Ez(I1,J1,K1) = this%Ez_d(I1,J1,K1)
+               Hx(I1,J1,K1) = this%Hx_d(I1,J1,K1)
+               Hy(I1,J1,K1) = this%Hy_d(I1,J1,K1)
+               Hz(I1,J1,K1) = this%Hz_d(I1,J1,K1)
+            else if (is_block) then
+               ! Block probes: download the surface/volume region
+               ! Only download if bounds are valid (element-based probes may have 0 bounds)
+               i1_m = I1; i2_m = sgg%Observation(ii)%P(i)%XE
+               j1_m = J1; j2_m = sgg%Observation(ii)%P(i)%YE
+               k1_m = K1; k2_m = sgg%Observation(ii)%P(i)%ZE
+               ! Skip if bounds are invalid (element-based probes use elementIds, not cell ranges)
+               if (i1_m > 0 .and. i2_m > 0 .and. j1_m > 0 .and. j2_m > 0 .and. k1_m > 0 .and. k2_m > 0) then
+                  do kkk = k1_m, k2_m
+                     do jjj = j1_m, j2_m
+                        do iii = i1_m, i2_m
+                           Ex(iii,jjj,kkk) = this%Ex_d(iii,jjj,kkk)
+                           Ey(iii,jjj,kkk) = this%Ey_d(iii,jjj,kkk)
+                           Ez(iii,jjj,kkk) = this%Ez_d(iii,jjj,kkk)
+                           Hx(iii,jjj,kkk) = this%Hx_d(iii,jjj,kkk)
+                           Hy(iii,jjj,kkk) = this%Hy_d(iii,jjj,kkk)
+                           Hz(iii,jjj,kkk) = this%Hz_d(iii,jjj,kkk)
+                        end do
+                     end do
+                  end do
+               end if
+            end if
+         end do
+      end do
+
+end subroutine gpu_download_probes
+
+    !--------------------------------------------------------------------------------
+    ! GPU kernel: sample point probes (single cell)
+    !--------------------------------------------------------------------------------
+    attributes(global) subroutine gpu_sample_point_probes_kernel(probe_results_d, probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d, &
+                                                                 Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, numProbes)
+       real(kind=rkind), intent(out), dimension(:), device :: probe_results_d
+       integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d
+       real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d
+       integer, value :: numProbes
+       integer :: idx, fieldId, i, j, k
+       real(kind=rkind) :: val
+
+       idx = (blockidx%x - 1) * blockdim%x + threadidx%x
+       if (idx > numProbes) return
+
+       fieldId = probe_field_ids_d(idx)
+       i = probe_I_d(idx)
+       j = probe_J_d(idx)
+       k = probe_K_d(idx)
+
+       select case(fieldId)
+       case(iEx); val = Ex_d(i,j,k)
+       case(iEy); val = Ey_d(i,j,k)
+       case(iEz); val = Ez_d(i,j,k)
+       case(iHx); val = Hx_d(i,j,k)
+       case(iHy); val = Hy_d(i,j,k)
+       case(iHz); val = Hz_d(i,j,k)
+       case default; val = 0.0_rkind
+       end select
+
+       probe_results_d(idx) = val
+
+    end subroutine gpu_sample_point_probes_kernel
+
+    !--------------------------------------------------------------------------------
+    ! GPU kernel: sample block probes (summation over block)
+    ! Each thread handles one block probe
+    !--------------------------------------------------------------------------------
+    attributes(global) subroutine gpu_sample_block_probes_kernel(probe_results_d, probe_field_ids_d, &
+                                                                   probe_I1_d, probe_J1_d, probe_K1_d, &
+                                                                   probe_I2_d, probe_J2_d, probe_K2_d, &
+                                                                   Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, &
+                                                                   numBlockProbes)
+       real(kind=rkind), intent(out), dimension(:), device :: probe_results_d
+       integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d
+       integer(kind=4), intent(in), dimension(:), device :: probe_I1_d, probe_J1_d, probe_K1_d
+       integer(kind=4), intent(in), dimension(:), device :: probe_I2_d, probe_J2_d, probe_K2_d
+       real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d
+       integer, value :: numBlockProbes
+       integer :: idx, fieldId, ii, jj, kk, i1, i2, j1, j2, k1, k2
+       real(kind=rkind) :: val
+
+       idx = (blockidx%x - 1) * blockdim%x + threadidx%x
+       if (idx > numBlockProbes) return
+
+       fieldId = probe_field_ids_d(idx)
+       i1 = probe_I1_d(idx)
+       i2 = probe_I2_d(idx)
+       j1 = probe_J1_d(idx)
+       j2 = probe_J2_d(idx)
+       k1 = probe_K1_d(idx)
+       k2 = probe_K2_d(idx)
+
+       val = 0.0_rkind
+
+       select case(fieldId)
+       case(iBloqueJx)
+          do jj = j1, j2
+             val = val + (Hy_d(i1, jj, k1 - 1) - Hy_d(i1, jj, k2))
+          end do
+          do kk = k1, k2
+             val = val + (-Hz_d(i1, j1 - 1, kk) + Hz_d(i1, j2, kk))
+          end do
+       case(iBloqueJy)
+          do kk = k1, k2
+             val = val + (-Hz_d(i2, j1, kk) + Hz_d(i1 - 1, j1, kk))
+          end do
+          do ii = i1, i2
+             val = val + (Hx_d(ii, j1, k2) - Hx_d(ii, j1, k1 - 1))
+          end do
+       case(iBloqueJz)
+          do ii = i1, i2
+             val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1))
+          end do
+          do jj = j1, j2
+             val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1))
+          end do
+       case(iBloqueMx)
+          do jj = j1, j2
+             val = val + (Hz_d(i1, jj, k1 - 1) - Hz_d(i1, jj, k2))
+          end do
+          do kk = k1, k2
+             val = val + (-Hy_d(i1, j1 - 1, kk) + Hy_d(i1, j2, kk))
+          end do
+       case(iBloqueMy)
+          do kk = k1, k2
+             val = val + (-Hx_d(i2, j1, kk) + Hx_d(i1 - 1, j1, kk))
+          end do
+          do ii = i1, i2
+             val = val + (Hz_d(ii, j1, k2) - Hz_d(ii, j1, k1 - 1))
+          end do
+       case(iBloqueMz)
+          do ii = i1, i2
+             val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1))
+          end do
+          do jj = j1, j2
+             val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1))
+          end do
+       case default
+          val = 0.0_rkind
+       end select
+
+       probe_results_d(idx) = val
+
+    end subroutine gpu_sample_block_probes_kernel
+
+     !--------------------------------------------------------------------------------
+     ! Fused probe sampling kernel — point + block probes in single launch
+     !--------------------------------------------------------------------------------
+     attributes(global) subroutine gpu_sample_all_probes_kernel(probe_results_d, block_probe_results_d, &
+                                                                  probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d, &
+                                                                  block_probe_field_ids_d, &
+                                                                  block_probe_I1_d, block_probe_J1_d, block_probe_K1_d, &
+                                                                  block_probe_I2_d, block_probe_J2_d, block_probe_K2_d, &
+                                                                  Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, &
+                                                                  pointCount, blockCount)
+        real(kind=rkind), intent(out), dimension(:), device :: probe_results_d, block_probe_results_d
+        integer(kind=4), intent(in), dimension(:), device :: probe_field_ids_d, probe_I_d, probe_J_d, probe_K_d
+        integer(kind=4), intent(in), dimension(:), device :: block_probe_field_ids_d
+        integer(kind=4), intent(in), dimension(:), device :: block_probe_I1_d, block_probe_J1_d, block_probe_K1_d
+        integer(kind=4), intent(in), dimension(:), device :: block_probe_I2_d, block_probe_J2_d, block_probe_K2_d
+        real(kind=rkind), dimension(:,:,:), device :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d
+        integer, value :: pointCount, blockCount
+        integer :: idx, fieldId, i, j, k, ii, jj, kk, i1, i2, j1, j2, k1, k2
+        real(kind=rkind) :: val
+
+        idx = (blockidx%x - 1) * blockdim%x + threadidx%x
+
+        ! Point probes (first pointCount threads)
+        if (idx <= pointCount) then
+           fieldId = probe_field_ids_d(idx)
+           i = probe_I_d(idx)
+           j = probe_J_d(idx)
+           k = probe_K_d(idx)
+           select case(fieldId)
+           case(iEx); val = Ex_d(i,j,k)
+           case(iEy); val = Ey_d(i,j,k)
+           case(iEz); val = Ez_d(i,j,k)
+           case(iHx); val = Hx_d(i,j,k)
+           case(iHy); val = Hy_d(i,j,k)
+           case(iHz); val = Hz_d(i,j,k)
+           case default; val = 0.0_rkind
+           end select
+           probe_results_d(idx) = val
+           return
+        end if
+
+        ! Block probes (remaining threads)
+        idx = idx - pointCount
+        if (idx > blockCount) return
+
+        fieldId = block_probe_field_ids_d(idx)
+        i1 = block_probe_I1_d(idx)
+        i2 = block_probe_I2_d(idx)
+        j1 = block_probe_J1_d(idx)
+        j2 = block_probe_J2_d(idx)
+        k1 = block_probe_K1_d(idx)
+        k2 = block_probe_K2_d(idx)
+
+        val = 0.0_rkind
+
+        select case(fieldId)
+        case(iBloqueJx)
+           do jj = j1, j2
+              val = val + (Hy_d(i1, jj, k1 - 1) - Hy_d(i1, jj, k2))
+           end do
+           do kk = k1, k2
+              val = val + (-Hz_d(i1, j1 - 1, kk) + Hz_d(i1, j2, kk))
+           end do
+        case(iBloqueJy)
+           do kk = k1, k2
+              val = val + (-Hz_d(i2, j1, kk) + Hz_d(i1 - 1, j1, kk))
+           end do
+           do ii = i1, i2
+              val = val + (Hx_d(ii, j1, k2) - Hx_d(ii, j1, k1 - 1))
+           end do
+        case(iBloqueJz)
+           do ii = i1, i2
+              val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1))
+           end do
+           do jj = j1, j2
+              val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1))
+           end do
+        case(iBloqueMx)
+           do jj = j1, j2
+              val = val + (Hz_d(i1, jj, k1 - 1) - Hz_d(i1, jj, k2))
+           end do
+           do kk = k1, k2
+              val = val + (-Hy_d(i1, j1 - 1, kk) + Hy_d(i1, j2, kk))
+           end do
+        case(iBloqueMy)
+           do kk = k1, k2
+              val = val + (-Hx_d(i2, j1, kk) + Hx_d(i1 - 1, j1, kk))
+           end do
+           do ii = i1, i2
+              val = val + (Hz_d(ii, j1, k2) - Hz_d(ii, j1, k1 - 1))
+           end do
+        case(iBloqueMz)
+           do ii = i1, i2
+              val = val + (Hy_d(ii, j2, k1) - Hy_d(ii, j1, k1))
+           end do
+           do jj = j1, j2
+              val = val + (-Hx_d(i1 - 1, jj, k1) + Hx_d(i2, jj, k1))
+           end do
+        case default
+           val = 0.0_rkind
+        end select
+
+        block_probe_results_d(idx) = val
+
+     end subroutine gpu_sample_all_probes_kernel
+
+     !--------------------------------------------------------------------------------
+    ! Initialize probe buffers on device
+    !--------------------------------------------------------------------------------
+    subroutine gpu_init_probe_buffers(this, sgg)
+       class(gpu_state_t), intent(inout) :: this
+       type(SGGFDTDINFO_t), intent(in) :: sgg
+
+       integer(kind=4) :: ii, i, nProbes, pointCount, blockCount
+       integer(kind=4) :: pointObservationCases(6)
+       integer(kind=4) :: blockObservationCases(6)
+       integer(kind=4) :: cuda_status
+
+       pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz]
+       blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz]
+       pointCount = 0
+       blockCount = 0
+
+       ! Count probes by type
+       do ii = 1, sgg%NumberRequest
+          if (.not. sgg%Observation(ii)%TimeDomain) cycle
+          nProbes = sgg%Observation(ii)%nP
+          do i = 1, nProbes
+             if (sgg%Observation(ii)%P(i)%what == nothing) cycle
+             if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then
+                pointCount = pointCount + 1
+             else if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then
+                blockCount = blockCount + 1
+             end if
+          end do
+       end do
+
+       if (pointCount == 0 .and. blockCount == 0) return
+
+       this%num_probe_results = pointCount
+       this%num_block_probe_results = blockCount
+
+       ! Allocate device buffers for point probes
+       if (pointCount > 0) then
+          allocate(this%probe_field_ids_d(pointCount))
+          allocate(this%probe_I_d(pointCount))
+          allocate(this%probe_J_d(pointCount))
+          allocate(this%probe_K_d(pointCount))
+          allocate(this%probe_results_d(pointCount))
+          cuda_status = cudaMemcpy(this%probe_results_d, 0.0_rkind, pointCount * 4, cudaMemcpyHostToDevice)
+       end if
+
+       ! Allocate device buffers for block probes
+       if (blockCount > 0) then
+          allocate(this%block_probe_field_ids_d(blockCount))
+          allocate(this%block_probe_I1_d(blockCount))
+          allocate(this%block_probe_J1_d(blockCount))
+          allocate(this%block_probe_K1_d(blockCount))
+          allocate(this%block_probe_I2_d(blockCount))
+          allocate(this%block_probe_J2_d(blockCount))
+          allocate(this%block_probe_K2_d(blockCount))
+          allocate(this%block_probe_results_d(blockCount))
+          cuda_status = cudaMemcpy(this%block_probe_results_d, 0.0_rkind, blockCount * 4, cudaMemcpyHostToDevice)
+       end if
+
+       ! Populate point probe metadata
+       if (pointCount > 0) then
+          call gpu_populate_point_probes(this, sgg)
+       end if
+
+       ! Populate block probe metadata
+       if (blockCount > 0) then
+          call gpu_populate_block_probes(this, sgg)
+       end if
+
+       this%probe_buffers_initialized = .true.
+
+    end subroutine gpu_init_probe_buffers
+
+    !--------------------------------------------------------------------------------
+    ! Populate point probe metadata on device
+    !--------------------------------------------------------------------------------
+    subroutine gpu_populate_point_probes(this, sgg)
+       class(gpu_state_t), intent(inout) :: this
+       type(SGGFDTDINFO_t), intent(in) :: sgg
+
+       integer(kind=4) :: ii, i, idx, nProbes
+       integer(kind=4) :: pointObservationCases(6)
+       integer(kind=4), allocatable, dimension(:) :: tmp_field_ids, tmp_I, tmp_J, tmp_K
+       integer(kind=4) :: cuda_status
+
+       pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz]
+       idx = 0
+       allocate(tmp_field_ids(this%num_probe_results))
+       allocate(tmp_I(this%num_probe_results))
+       allocate(tmp_J(this%num_probe_results))
+       allocate(tmp_K(this%num_probe_results))
+
+       do ii = 1, sgg%NumberRequest
+          if (.not. sgg%Observation(ii)%TimeDomain) cycle
+          nProbes = sgg%Observation(ii)%nP
+          do i = 1, nProbes
+             if (sgg%Observation(ii)%P(i)%what == nothing) cycle
+             if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then
+                idx = idx + 1
+                tmp_field_ids(idx) = sgg%Observation(ii)%P(i)%what
+                tmp_I(idx) = sgg%Observation(ii)%P(i)%XI
+                tmp_J(idx) = sgg%Observation(ii)%P(i)%YI
+                tmp_K(idx) = sgg%Observation(ii)%P(i)%ZI
+             end if
+          end do
+       end do
+
+       cuda_status = cudaMemcpy(this%probe_field_ids_d, tmp_field_ids, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%probe_I_d, tmp_I, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%probe_J_d, tmp_J, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%probe_K_d, tmp_K, idx * 4, cudaMemcpyHostToDevice)
+
+       deallocate(tmp_field_ids, tmp_I, tmp_J, tmp_K)
+
+    end subroutine gpu_populate_point_probes
+
+    !--------------------------------------------------------------------------------
+    ! Populate block probe metadata on device
+    !--------------------------------------------------------------------------------
+    subroutine gpu_populate_block_probes(this, sgg)
+       class(gpu_state_t), intent(inout) :: this
+       type(SGGFDTDINFO_t), intent(in) :: sgg
+
+       integer(kind=4) :: ii, i, idx, nProbes
+       integer(kind=4) :: blockObservationCases(6)
+       integer(kind=4), allocatable, dimension(:) :: tmp_field_ids, tmp_I1, tmp_J1, tmp_K1
+       integer(kind=4), allocatable, dimension(:) :: tmp_I2, tmp_J2, tmp_K2
+       integer(kind=4) :: cuda_status
+
+       blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz]
+       idx = 0
+       allocate(tmp_field_ids(this%num_block_probe_results))
+       allocate(tmp_I1(this%num_block_probe_results))
+       allocate(tmp_J1(this%num_block_probe_results))
+       allocate(tmp_K1(this%num_block_probe_results))
+       allocate(tmp_I2(this%num_block_probe_results))
+       allocate(tmp_J2(this%num_block_probe_results))
+       allocate(tmp_K2(this%num_block_probe_results))
+
+       do ii = 1, sgg%NumberRequest
+          if (.not. sgg%Observation(ii)%TimeDomain) cycle
+          nProbes = sgg%Observation(ii)%nP
+          do i = 1, nProbes
+             if (sgg%Observation(ii)%P(i)%what == nothing) cycle
+             if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then
+                idx = idx + 1
+                tmp_field_ids(idx) = sgg%Observation(ii)%P(i)%what
+                tmp_I1(idx) = sgg%Observation(ii)%P(i)%XI
+                tmp_J1(idx) = sgg%Observation(ii)%P(i)%YI
+                tmp_K1(idx) = sgg%Observation(ii)%P(i)%ZI
+                tmp_I2(idx) = sgg%Observation(ii)%P(i)%XE
+                tmp_J2(idx) = sgg%Observation(ii)%P(i)%YE
+                tmp_K2(idx) = sgg%Observation(ii)%P(i)%ZE
+             end if
+          end do
+       end do
+
+       cuda_status = cudaMemcpy(this%block_probe_field_ids_d, tmp_field_ids, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_I1_d, tmp_I1, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_J1_d, tmp_J1, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_K1_d, tmp_K1, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_I2_d, tmp_I2, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_J2_d, tmp_J2, idx * 4, cudaMemcpyHostToDevice)
+       cuda_status = cudaMemcpy(this%block_probe_K2_d, tmp_K2, idx * 4, cudaMemcpyHostToDevice)
+
+       deallocate(tmp_field_ids, tmp_I1, tmp_J1, tmp_K1, tmp_I2, tmp_J2, tmp_K2)
+
+    end subroutine gpu_populate_block_probes
+
+   !--------------------------------------------------------------------------------
+     ! Sample point probes on GPU (public interface)
+     !--------------------------------------------------------------------------------
+     subroutine gpu_sample_point_probes(this, results_h, nTime)
+        class(gpu_state_t), intent(inout) :: this
+        real(kind=rkind), dimension(:), intent(out) :: results_h
+        integer(kind=4), intent(in) :: nTime
+
+        integer(kind=4) :: pointCount, blockSize, gridSize
+        integer(kind=4) :: cuda_status
+
+        pointCount = this%num_probe_results
+        if (pointCount == 0) return
+
+        blockSize = 256
+        gridSize = (pointCount + blockSize - 1) / blockSize
+        call gpu_sample_point_probes_kernel<<<gridSize, blockSize>>>( &
+           this%probe_results_d, this%probe_field_ids_d, this%probe_I_d, this%probe_J_d, this%probe_K_d, &
+           this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, pointCount)
+        cuda_status = cudaMemcpy(results_h, this%probe_results_d, pointCount * 4, cudaMemcpyDeviceToHost)
+
+     end subroutine gpu_sample_point_probes
+
+     !--------------------------------------------------------------------------------
+     ! Sample block probes on GPU (public interface)
+     !--------------------------------------------------------------------------------
+     subroutine gpu_sample_block_probes(this, results_h, nTime)
+        class(gpu_state_t), intent(inout) :: this
+        real(kind=rkind), dimension(:), intent(out) :: results_h
+        integer(kind=4), intent(in) :: nTime
+
+        integer(kind=4) :: blockCount, blockSize, gridSize
+        integer(kind=4) :: cuda_status
+
+        blockCount = this%num_block_probe_results
+        if (blockCount == 0) return
+
+        blockSize = 256
+        gridSize = (blockCount + blockSize - 1) / blockSize
+        call gpu_sample_block_probes_kernel<<<gridSize, blockSize>>>( &
+           this%block_probe_results_d, this%block_probe_field_ids_d, &
+           this%block_probe_I1_d, this%block_probe_J1_d, this%block_probe_K1_d, &
+           this%block_probe_I2_d, this%block_probe_J2_d, this%block_probe_K2_d, &
+           this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, &
+           blockCount)
+        cuda_status = cudaMemcpy(results_h, this%block_probe_results_d, blockCount * 4, cudaMemcpyDeviceToHost)
+
+     end subroutine gpu_sample_block_probes
+
+     !--------------------------------------------------------------------------------
+     ! Fused probe sampling — point + block in single kernel launch
+     !--------------------------------------------------------------------------------
+     subroutine gpu_sample_all_probes(this, point_results_h, block_results_h, nTime)
+        class(gpu_state_t), intent(inout) :: this
+        real(kind=rkind), dimension(:), intent(out) :: point_results_h
+        real(kind=rkind), dimension(:), intent(out) :: block_results_h
+        integer(kind=4), intent(in) :: nTime
+
+        integer(kind=4) :: pointCount, blockCount, totalProbes, blockSize, gridSize
+        integer(kind=4) :: cuda_status
+
+        pointCount = this%num_probe_results
+        blockCount = this%num_block_probe_results
+        totalProbes = pointCount + blockCount
+
+        if (totalProbes == 0) return
+
+        blockSize = 256
+        gridSize = (totalProbes + blockSize - 1) / blockSize
+
+        call gpu_sample_all_probes_kernel<<<gridSize, blockSize>>>( &
+           this%probe_results_d, this%block_probe_results_d, &
+           this%probe_field_ids_d, this%probe_I_d, this%probe_J_d, this%probe_K_d, &
+           this%block_probe_field_ids_d, &
+           this%block_probe_I1_d, this%block_probe_J1_d, this%block_probe_K1_d, &
+           this%block_probe_I2_d, this%block_probe_J2_d, this%block_probe_K2_d, &
+           this%Ex_d, this%Ey_d, this%Ez_d, this%Hx_d, this%Hy_d, this%Hz_d, &
+           pointCount, blockCount)
+
+        if (pointCount > 0) then
+           cuda_status = cudaMemcpy(point_results_h, this%probe_results_d, pointCount * 4, cudaMemcpyDeviceToHost)
+        endif
+        if (blockCount > 0) then
+           cuda_status = cudaMemcpy(block_results_h, this%block_probe_results_d, blockCount * 4, cudaMemcpyDeviceToHost)
+        endif
+
+     end subroutine gpu_sample_all_probes
+
+    !--------------------------------------------------------------------------------
+    ! Destroy GPU probe buffers
+    !--------------------------------------------------------------------------------
+    subroutine gpu_destroy_probe_buffers(this)
+       class(gpu_state_t), intent(inout) :: this
+
+       if (this%probe_buffers_initialized) then
+          if (associated(this%probe_results_d)) deallocate(this%probe_results_d)
+          if (associated(this%probe_field_ids_d)) deallocate(this%probe_field_ids_d)
+          if (associated(this%probe_I_d)) deallocate(this%probe_I_d)
+          if (associated(this%probe_J_d)) deallocate(this%probe_J_d)
+          if (associated(this%probe_K_d)) deallocate(this%probe_K_d)
+          if (associated(this%block_probe_results_d)) deallocate(this%block_probe_results_d)
+          if (associated(this%block_probe_field_ids_d)) deallocate(this%block_probe_field_ids_d)
+          if (associated(this%block_probe_I1_d)) deallocate(this%block_probe_I1_d)
+          if (associated(this%block_probe_J1_d)) deallocate(this%block_probe_J1_d)
+          if (associated(this%block_probe_K1_d)) deallocate(this%block_probe_K1_d)
+          if (associated(this%block_probe_I2_d)) deallocate(this%block_probe_I2_d)
+          if (associated(this%block_probe_J2_d)) deallocate(this%block_probe_J2_d)
+          if (associated(this%block_probe_K2_d)) deallocate(this%block_probe_K2_d)
+          this%probe_buffers_initialized = .false.
+       end if
+
+    end subroutine gpu_destroy_probe_buffers
+
+     !--------------------------------------------------------------------------------
+     ! Initialize NF2FF (near-to-far-field) buffers on GPU
+     !--------------------------------------------------------------------------------
+     subroutine gpu_init_nf2ff_buffers(this, ExIz, ExDe, ExAb, ExAr, EyFr, EyTr, EyAb, EyAr, EzIz, EzDe, EzFr, EzTr, &
+                                        HxIz, HxDe, HxAb, HxAr, HyFr, HyTr, HyAb, HyAr, HzIz, HzDe, HzFr, HzTr, &
+                                        HxIz2, HxDe2, HxAb2, HxAr2, HyFr2, HyTr2, HyAb2, HyAr2, HzIz2, HzDe2, HzFr2, HzTr2, &
+                                        expIwdt, auxExp_E, auxExp_H, &
+                                        phys_x_Mx, phys_y_Mx, phys_z_Mx, &
+                                        phys_x_My, phys_y_My, phys_z_My, &
+                                        phys_x_Mz, phys_y_Mz, phys_z_Mz, &
+                                        phys_x_Jx, phys_y_Jx, phys_z_Jx, &
+                                        phys_x_Jy, phys_y_Jy, phys_z_Jy, &
+                                        phys_x_Jz, phys_y_Jz, phys_z_Jz, &
+                                        dyh_in, dye_in, dze_in, dzh_in, &
+                                         numCells, numFreqs, Ntheta, Nphi, &
+                                         thetaStart, thetaStop, thetaStep, phiStart, phiStop, phiStep, &
+                                         freqStep, initialFreq, cluz, z0, XDobleAncho, YDobleAncho, ZDobleAncho, sym_flags)
+        class(gpu_state_t), intent(inout) :: this
+        complex(kind=rkind), dimension(:,:,:), intent(in) :: ExIz, ExDe, ExAb, ExAr, EyFr, EyTr, EyAb, EyAr, EzIz, EzDe, EzFr, EzTr
+        complex(kind=rkind), dimension(:,:,:), intent(in) :: HxIz, HxDe, HxAb, HxAr, HyFr, HyTr, HyAb, HyAr, HzIz, HzDe, HzFr, HzTr
+        complex(kind=rkind), dimension(:,:,:), intent(in) :: HxIz2, HxDe2, HxAb2, HxAr2, HyFr2, HyTr2, HyAb2, HyAr2, HzIz2, HzDe2, HzFr2, HzTr2
+        complex(kind=rkind), dimension(:), intent(in) :: expIwdt, auxExp_E, auxExp_H
+        real(kind=rkind), dimension(:), intent(in) :: phys_x_Mx, phys_y_Mx, phys_z_Mx
+        real(kind=rkind), dimension(:), intent(in) :: phys_x_My, phys_y_My, phys_z_My
+        real(kind=rkind), dimension(:), intent(in) :: phys_x_Mz, phys_y_Mz, phys_z_Mz
+        real(kind=rkind), dimension(:), intent(in) :: phys_x_Jx, phys_y_Jx, phys_z_Jx
+        real(kind=rkind), dimension(:), intent(in) :: phys_x_Jy, phys_y_Jy, phys_z_Jy
+     real(kind=rkind), dimension(:), intent(in) :: phys_x_Jz, phys_y_Jz, phys_z_Jz
+         integer(kind=4), intent(in) :: numCells, numFreqs, Ntheta, Nphi
+        real(kind=rkind), intent(in) :: thetaStart, thetaStop, thetaStep, phiStart, phiStop, phiStep
+        real(kind=rkind), intent(in) :: freqStep, initialFreq, cluz, z0
+        real(kind=rkind), intent(in) :: XDobleAncho, YDobleAncho, ZDobleAncho
+        integer(kind=4), intent(in) :: sym_flags
+        real(kind=rkind), dimension(:), intent(in) :: dyh_in, dye_in, dze_in, dzh_in
+
+        integer(kind=4) :: nx, ny, nz, cuda_status
+
+        if (.not. this%initialized) return
+        if (this%nf2ff_initialized) return
+
+        ! Determine buffer dimensions from first array
+        nx = ubound(ExIz, 1) - lbound(ExIz, 1) + 1
+        ny = ubound(ExIz, 2) - lbound(ExIz, 2) + 1
+        nz = ubound(ExIz, 3) - lbound(ExIz, 3) + 1
+
+        ! Allocate NF2FF DFT buffers (18 complex arrays, 3D)
+        if (numCells > 0 .and. numFreqs > 0) then
+           allocate(this%nf2ff_ExIz_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_ExDe_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_ExAb_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_ExAr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EyFr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EyTr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EyAb_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EyAr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EzIz_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EzDe_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EzFr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_EzTr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxIz_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxDe_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxAb_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxAr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyFr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyTr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyAb_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyAr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzIz_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzDe_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzFr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzTr_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxIz2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxDe2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxAb2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HxAr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyFr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyTr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyAb2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HyAr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzIz2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzDe2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzFr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+           allocate(this%nf2ff_HzTr2_d(0:nx-1, 0:ny-1, 0:nz-1))
+
+           ! Copy DFT buffers from host to device
+           cuda_status = cudaMemcpy(this%nf2ff_ExIz_d, ExIz, size(ExIz)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_ExDe_d, ExDe, size(ExDe)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_ExAb_d, ExAb, size(ExAb)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_ExAr_d, ExAr, size(ExAr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EyFr_d, EyFr, size(EyFr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EyTr_d, EyTr, size(EyTr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EyAb_d, EyAb, size(EyAb)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EyAr_d, EyAr, size(EyAr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EzIz_d, EzIz, size(EzIz)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EzDe_d, EzDe, size(EzDe)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EzFr_d, EzFr, size(EzFr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_EzTr_d, EzTr, size(EzTr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxIz_d, HxIz, size(HxIz)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxDe_d, HxDe, size(HxDe)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxAb_d, HxAb, size(HxAb)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxAr_d, HxAr, size(HxAr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyFr_d, HyFr, size(HyFr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyTr_d, HyTr, size(HyTr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyAb_d, HyAb, size(HyAb)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyAr_d, HyAr, size(HyAr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzIz_d, HzIz, size(HzIz)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzDe_d, HzDe, size(HzDe)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzFr_d, HzFr, size(HzFr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzTr_d, HzTr, size(HzTr)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxIz2_d, HxIz2, size(HxIz2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxDe2_d, HxDe2, size(HxDe2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxAb2_d, HxAb2, size(HxAb2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HxAr2_d, HxAr2, size(HxAr2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyFr2_d, HyFr2, size(HyFr2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyTr2_d, HyTr2, size(HyTr2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyAb2_d, HyAb2, size(HyAb2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HyAr2_d, HyAr2, size(HyAr2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzIz2_d, HzIz2, size(HzIz2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzDe2_d, HzDe2, size(HzDe2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzFr2_d, HzFr2, size(HzFr2)*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_HzTr2_d, HzTr2, size(HzTr2)*8, cudaMemcpyHostToDevice)
+        endif
+
+        ! Allocate and copy frequency arrays
+        if (numFreqs > 0) then
+           allocate(this%nf2ff_expIwdt_d(0:numFreqs-1))
+           allocate(this%nf2ff_auxExp_E_d(0:numFreqs-1))
+           allocate(this%nf2ff_auxExp_H_d(0:numFreqs-1))
+           cuda_status = cudaMemcpy(this%nf2ff_expIwdt_d, expIwdt, numFreqs*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_auxExp_E_d, auxExp_E, numFreqs*8, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_auxExp_H_d, auxExp_H, numFreqs*8, cudaMemcpyHostToDevice)
+        endif
+
+        ! Allocate and copy geometry arrays (18 coordinate arrays)
+        if (numCells > 0) then
+           allocate(this%nf2ff_phys_x_Mx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_Mx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_Mx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_x_My_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_My_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_My_d(0:numCells-1))
+           allocate(this%nf2ff_phys_x_Mz_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_Mz_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_Mz_d(0:numCells-1))
+           allocate(this%nf2ff_phys_x_Jx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_Jx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_Jx_d(0:numCells-1))
+           allocate(this%nf2ff_phys_x_Jy_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_Jy_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_Jy_d(0:numCells-1))
+           allocate(this%nf2ff_phys_x_Jz_d(0:numCells-1))
+           allocate(this%nf2ff_phys_y_Jz_d(0:numCells-1))
+           allocate(this%nf2ff_phys_z_Jz_d(0:numCells-1))
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_Mx_d, phys_x_Mx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_Mx_d, phys_y_Mx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_Mx_d, phys_z_Mx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_My_d, phys_x_My, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_My_d, phys_y_My, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_My_d, phys_z_My, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_Mz_d, phys_x_Mz, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_Mz_d, phys_y_Mz, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_Mz_d, phys_z_Mz, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jx_d, phys_x_Jx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jx_d, phys_y_Jx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jx_d, phys_z_Jx, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jy_d, phys_x_Jy, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jy_d, phys_y_Jy, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jy_d, phys_y_Jy, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_x_Jz_d, phys_x_Jz, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_y_Jz_d, phys_y_Jz, numCells*4, cudaMemcpyHostToDevice)
+           cuda_status = cudaMemcpy(this%nf2ff_phys_z_Jz_d, phys_z_Jz, numCells*4, cudaMemcpyHostToDevice)
+        endif
+
+        ! Allocate and copy cell dimension arrays
+         if (numCells > 0) then
+            allocate(this%nf2ff_dyh_d(0:numCells-1))
+            allocate(this%nf2ff_dze_d(0:numCells-1))
+            allocate(this%nf2ff_dye_d(0:numCells-1))
+            allocate(this%nf2ff_dzh_d(0:numCells-1))
+            cuda_status = cudaMemcpy(this%nf2ff_dyh_d, dyh_in, numCells*4, cudaMemcpyHostToDevice)
+            cuda_status = cudaMemcpy(this%nf2ff_dze_d, dze_in, numCells*4, cudaMemcpyHostToDevice)
+            cuda_status = cudaMemcpy(this%nf2ff_dye_d, dye_in, numCells*4, cudaMemcpyHostToDevice)
+            cuda_status = cudaMemcpy(this%nf2ff_dzh_d, dzh_in, numCells*4, cudaMemcpyHostToDevice)
+         endif
+
+        ! Allocate output buffers
+        if (Ntheta > 0 .and. Nphi > 0 .and. numFreqs > 0) then
+           allocate(this%nf2ff_Etheta_d(0:numFreqs*Ntheta*Nphi-1))
+           allocate(this%nf2ff_Ephi_d(0:numFreqs*Ntheta*Nphi-1))
+           allocate(this%nf2ff_RCS_d(0:numFreqs*Ntheta*Nphi-1))
+        endif
+
+        ! Store configuration
+        this%nf2ff_num_cells = numCells
+        this%nf2ff_num_freqs = numFreqs
+        this%nf2ff_Ntheta = Ntheta
+        this%nf2ff_Nphi = Nphi
+        this%nf2ff_thetaStep = thetaStep
+        this%nf2ff_phiStep = phiStep
+        this%nf2ff_freqStep = freqStep
+        this%nf2ff_initialFreq = initialFreq
+        this%nf2ff_cluz = cluz
+        this%nf2ff_z0 = z0
+        this%nf2ff_XDobleAncho = XDobleAncho
+        this%nf2ff_YDobleAncho = YDobleAncho
+        this%nf2ff_ZDobleAncho = ZDobleAncho
+        this%nf2ff_sym_flags = sym_flags
+
+        this%nf2ff_initialized = .true.
+
+     end subroutine gpu_init_nf2ff_buffers
+
+     !--------------------------------------------------------------------------------
+     ! Destroy NF2FF device buffers
+     !--------------------------------------------------------------------------------
+     subroutine gpu_destroy_nf2ff_buffers(this)
+        class(gpu_state_t), intent(inout) :: this
+
+        if (.not. this%nf2ff_initialized) return
+
+        if (associated(this%nf2ff_ExIz_d)) deallocate(this%nf2ff_ExIz_d)
+        if (associated(this%nf2ff_ExDe_d)) deallocate(this%nf2ff_ExDe_d)
+        if (associated(this%nf2ff_ExAb_d)) deallocate(this%nf2ff_ExAb_d)
+        if (associated(this%nf2ff_ExAr_d)) deallocate(this%nf2ff_ExAr_d)
+        if (associated(this%nf2ff_EyFr_d)) deallocate(this%nf2ff_EyFr_d)
+        if (associated(this%nf2ff_EyTr_d)) deallocate(this%nf2ff_EyTr_d)
+        if (associated(this%nf2ff_EyAb_d)) deallocate(this%nf2ff_EyAb_d)
+        if (associated(this%nf2ff_EyAr_d)) deallocate(this%nf2ff_EyAr_d)
+        if (associated(this%nf2ff_EzIz_d)) deallocate(this%nf2ff_EzIz_d)
+        if (associated(this%nf2ff_EzDe_d)) deallocate(this%nf2ff_EzDe_d)
+        if (associated(this%nf2ff_EzFr_d)) deallocate(this%nf2ff_EzFr_d)
+        if (associated(this%nf2ff_EzTr_d)) deallocate(this%nf2ff_EzTr_d)
+        if (associated(this%nf2ff_HxIz_d)) deallocate(this%nf2ff_HxIz_d)
+        if (associated(this%nf2ff_HxDe_d)) deallocate(this%nf2ff_HxDe_d)
+        if (associated(this%nf2ff_HxAb_d)) deallocate(this%nf2ff_HxAb_d)
+        if (associated(this%nf2ff_HxAr_d)) deallocate(this%nf2ff_HxAr_d)
+        if (associated(this%nf2ff_HyFr_d)) deallocate(this%nf2ff_HyFr_d)
+        if (associated(this%nf2ff_HyTr_d)) deallocate(this%nf2ff_HyTr_d)
+        if (associated(this%nf2ff_HyAb_d)) deallocate(this%nf2ff_HyAb_d)
+        if (associated(this%nf2ff_HyAr_d)) deallocate(this%nf2ff_HyAr_d)
+        if (associated(this%nf2ff_HzIz_d)) deallocate(this%nf2ff_HzIz_d)
+        if (associated(this%nf2ff_HzDe_d)) deallocate(this%nf2ff_HzDe_d)
+        if (associated(this%nf2ff_HzFr_d)) deallocate(this%nf2ff_HzFr_d)
+        if (associated(this%nf2ff_HzTr_d)) deallocate(this%nf2ff_HzTr_d)
+        if (associated(this%nf2ff_HxIz2_d)) deallocate(this%nf2ff_HxIz2_d)
+        if (associated(this%nf2ff_HxDe2_d)) deallocate(this%nf2ff_HxDe2_d)
+        if (associated(this%nf2ff_HxAb2_d)) deallocate(this%nf2ff_HxAb2_d)
+        if (associated(this%nf2ff_HxAr2_d)) deallocate(this%nf2ff_HxAr2_d)
+        if (associated(this%nf2ff_HyFr2_d)) deallocate(this%nf2ff_HyFr2_d)
+        if (associated(this%nf2ff_HyTr2_d)) deallocate(this%nf2ff_HyTr2_d)
+        if (associated(this%nf2ff_HyAb2_d)) deallocate(this%nf2ff_HyAb2_d)
+        if (associated(this%nf2ff_HyAr2_d)) deallocate(this%nf2ff_HyAr2_d)
+        if (associated(this%nf2ff_HzIz2_d)) deallocate(this%nf2ff_HzIz2_d)
+        if (associated(this%nf2ff_HzDe2_d)) deallocate(this%nf2ff_HzDe2_d)
+        if (associated(this%nf2ff_HzFr2_d)) deallocate(this%nf2ff_HzFr2_d)
+        if (associated(this%nf2ff_HzTr2_d)) deallocate(this%nf2ff_HzTr2_d)
+
+        if (associated(this%nf2ff_expIwdt_d)) deallocate(this%nf2ff_expIwdt_d)
+        if (associated(this%nf2ff_auxExp_E_d)) deallocate(this%nf2ff_auxExp_E_d)
+        if (associated(this%nf2ff_auxExp_H_d)) deallocate(this%nf2ff_auxExp_H_d)
+
+        if (associated(this%nf2ff_phys_x_Mx_d)) deallocate(this%nf2ff_phys_x_Mx_d)
+        if (associated(this%nf2ff_phys_y_Mx_d)) deallocate(this%nf2ff_phys_y_Mx_d)
+        if (associated(this%nf2ff_phys_z_Mx_d)) deallocate(this%nf2ff_phys_z_Mx_d)
+        if (associated(this%nf2ff_phys_x_My_d)) deallocate(this%nf2ff_phys_x_My_d)
+        if (associated(this%nf2ff_phys_y_My_d)) deallocate(this%nf2ff_phys_y_My_d)
+        if (associated(this%nf2ff_phys_z_My_d)) deallocate(this%nf2ff_phys_z_My_d)
+        if (associated(this%nf2ff_phys_x_Mz_d)) deallocate(this%nf2ff_phys_x_Mz_d)
+        if (associated(this%nf2ff_phys_y_Mz_d)) deallocate(this%nf2ff_phys_y_Mz_d)
+        if (associated(this%nf2ff_phys_z_Mz_d)) deallocate(this%nf2ff_phys_z_Mz_d)
+        if (associated(this%nf2ff_phys_x_Jx_d)) deallocate(this%nf2ff_phys_x_Jx_d)
+        if (associated(this%nf2ff_phys_y_Jx_d)) deallocate(this%nf2ff_phys_y_Jx_d)
+        if (associated(this%nf2ff_phys_z_Jx_d)) deallocate(this%nf2ff_phys_z_Jx_d)
+        if (associated(this%nf2ff_phys_x_Jy_d)) deallocate(this%nf2ff_phys_x_Jy_d)
+        if (associated(this%nf2ff_phys_y_Jy_d)) deallocate(this%nf2ff_phys_y_Jy_d)
+        if (associated(this%nf2ff_phys_z_Jy_d)) deallocate(this%nf2ff_phys_z_Jy_d)
+        if (associated(this%nf2ff_phys_x_Jz_d)) deallocate(this%nf2ff_phys_x_Jz_d)
+        if (associated(this%nf2ff_phys_y_Jz_d)) deallocate(this%nf2ff_phys_y_Jz_d)
+        if (associated(this%nf2ff_phys_z_Jz_d)) deallocate(this%nf2ff_phys_z_Jz_d)
+
+        if (associated(this%nf2ff_dyh_d)) deallocate(this%nf2ff_dyh_d)
+        if (associated(this%nf2ff_dze_d)) deallocate(this%nf2ff_dze_d)
+        if (associated(this%nf2ff_dye_d)) deallocate(this%nf2ff_dye_d)
+        if (associated(this%nf2ff_dzh_d)) deallocate(this%nf2ff_dzh_d)
+
+        if (associated(this%nf2ff_Etheta_d)) deallocate(this%nf2ff_Etheta_d)
+        if (associated(this%nf2ff_Ephi_d)) deallocate(this%nf2ff_Ephi_d)
+        if (associated(this%nf2ff_RCS_d)) deallocate(this%nf2ff_RCS_d)
+
+        this%nf2ff_initialized = .false.
+
+     end subroutine gpu_destroy_nf2ff_buffers
+
+  end module gpu_core_probe_m
diff --git a/src_main_pub/gpu_cpml_m.F90 b/src_main_pub/gpu_cpml_m.F90
new file mode 100644
index 000000000..718e23ab5
--- /dev/null
+++ b/src_main_pub/gpu_cpml_m.F90
@@ -0,0 +1,903 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU CPML KERNELS MODULE - CUDA Fortran (CUF) accelerated CPML kernels
+!  Left + Right boundary CPML with persistent psi state on device.
+!  Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_cpml_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! CPML Left Boundary - GPU accelerated kernels
+   ! Uses persistent device g2_d/gm2_d arrays (already on device from gpu_init)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_left(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_left_initialized) return
+
+      ! Advance Ex on left boundary
+      call gpu_advanceCPML_Ex_left_kernel(this%Ex_d, this%Hz_d, this%sggMiEx_d, &
+                                          this%pml_psi_Exy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, &
+                                          this%g2_d, &
+                                          this%pml_left_Ex_ii, this%pml_left_Ex_ij, &
+                                          this%pml_left_Ex_ji, this%pml_left_Ex_jj, &
+                                          this%pml_left_Ex_ki, this%pml_left_Ex_kj, &
+                                          b%Ex%XI-1)
+
+      ! Advance Ez on left boundary
+      call gpu_advanceCPML_Ez_left_kernel(this%Ez_d, this%Hx_d, this%sggMiEz_d, &
+                                          this%pml_psi_Ezy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, &
+                                          this%g2_d, &
+                                          this%pml_left_Ez_ii, this%pml_left_Ez_ij, &
+                                          this%pml_left_Ez_ji, this%pml_left_Ez_jj, &
+                                          this%pml_left_Ez_ki, this%pml_left_Ez_kj, &
+                                          b%Ez%XI-1)
+
+   end subroutine gpu_advanceCPML_E_left
+
+   subroutine gpu_advanceCPML_Ex_left_kernel(Ex_d, Hz_d, sggMiEx_d, psi_Exy_d, &
+                                              P_be_y_d, P_ce_y_d, g2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hz_d, psi_Exy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d
+      real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEx_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hz_d(i,j,k) - Hz_d(i,j-1,k)) * P_ce_y_d(j)
+              Ex_d(i,j,k) = Ex_d(i,j,k) + g2_d(medio) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ex_left_kernel
+
+   subroutine gpu_advanceCPML_Ez_left_kernel(Ez_d, Hx_d, sggMiEz_d, psi_Ezy_d, &
+                                              P_be_y_d, P_ce_y_d, g2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, psi_Ezy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d
+      real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hx_d(i,j,k) - Hx_d(i,j-1,k)) * P_ce_y_d(j)
+              Ez_d(i,j,k) = Ez_d(i,j,k) - g2_d(medio) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ez_left_kernel
+
+   subroutine gpu_advanceCPML_H_left(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_left_initialized) return
+
+      ! Advance Hx on left boundary
+      call gpu_advanceCPML_Hx_left_kernel(this%Hx_d, this%Ez_d, this%sggMiHx_d, &
+                                          this%pml_psi_Hxy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, &
+                                          this%gm2_d, &
+                                          this%pml_left_Hx_ii, this%pml_left_Hx_ij, &
+                                          this%pml_left_Hx_ji, this%pml_left_Hx_jj, &
+                                          this%pml_left_Hx_ki, this%pml_left_Hx_kj, &
+                                          b%Hx%XI-1)
+
+      ! Advance Hz on left boundary
+      call gpu_advanceCPML_Hz_left_kernel(this%Hz_d, this%Ex_d, this%sggMiHz_d, &
+                                          this%pml_psi_Hzy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, &
+                                          this%gm2_d, &
+                                          this%pml_left_Hz_ii, this%pml_left_Hz_ij, &
+                                          this%pml_left_Hz_ji, this%pml_left_Hz_jj, &
+                                          this%pml_left_Hz_ki, this%pml_left_Hz_kj, &
+                                          b%Hz%XI-1)
+
+   end subroutine gpu_advanceCPML_H_left
+
+   subroutine gpu_advanceCPML_Hx_left_kernel(Hx_d, Ez_d, sggMiHx_d, psi_Hxy_d, &
+                                             P_bm_y_d, P_cm_y_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ez_d, psi_Hxy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHx_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ez_d(i,j+1,k) - Ez_d(i,j,k)) * P_cm_y_d(j)
+              Hx_d(i,j,k) = Hx_d(i,j,k) - gm2_d(medio) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hx_left_kernel
+
+   subroutine gpu_advanceCPML_Hz_left_kernel(Hz_d, Ex_d, sggMiHz_d, psi_Hzy_d, &
+                                             P_bm_y_d, P_cm_y_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, psi_Hzy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ex_d(i,j+1,k) - Ex_d(i,j,k)) * P_cm_y_d(j)
+              Hz_d(i,j,k) = Hz_d(i,j,k) + gm2_d(medio) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hz_left_kernel
+
+   !--------------------------------------------------------------------------------
+   ! CPML Right boundary - wrapper subroutines
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_right(this, b, numMedia)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+      integer, intent(in) :: numMedia
+      if (.not. this%pml_right_initialized) return
+      call gpu_advanceCPML_Ex_right_kernel(this%Ex_d, this%Hz_d, this%sggMiEx_d, &
+                                          this%pml_psi_Exy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, &
+                                          this%g2_d, numMedia, &
+                                          this%pml_right_Ex_ii, this%pml_right_Ex_ij, &
+                                          this%pml_right_Ex_ji, this%pml_right_Ex_jj, &
+                                          this%pml_right_Ex_ki, this%pml_right_Ex_kj, &
+                                          b%Ex%XI-1)
+      call gpu_advanceCPML_Ez_right_kernel(this%Ez_d, this%Hx_d, this%sggMiEz_d, &
+                                          this%pml_psi_Ezy_left, this%pml_P_be_y_left, this%pml_P_ce_y_left, &
+                                          this%g2_d, numMedia, &
+                                          this%pml_right_Ez_ii, this%pml_right_Ez_ij, &
+                                          this%pml_right_Ez_ji, this%pml_right_Ez_jj, &
+                                          this%pml_right_Ez_ki, this%pml_right_Ez_kj, &
+                                          b%Ez%XI-1)
+   end subroutine gpu_advanceCPML_E_right
+
+   subroutine gpu_advanceCPML_H_right(this, b, numMedia)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+      integer, intent(in) :: numMedia
+      if (.not. this%pml_right_initialized) return
+      call gpu_advanceCPML_Hx_right_kernel(this%Hx_d, this%Ez_d, this%sggMiHx_d, &
+                                          this%pml_psi_Hxy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, &
+                                          this%gm2_d, numMedia, &
+                                          this%pml_right_Hx_ii, this%pml_right_Hx_ij, &
+                                          this%pml_right_Hx_ji, this%pml_right_Hx_jj, &
+                                          this%pml_right_Hx_ki, this%pml_right_Hx_kj, &
+                                          b%Hx%XI-1)
+      call gpu_advanceCPML_Hz_right_kernel(this%Hz_d, this%Ex_d, this%sggMiHz_d, &
+                                          this%pml_psi_Hzy_left, this%pml_P_bm_y_left, this%pml_P_cm_y_left, &
+                                          this%gm2_d, numMedia, &
+                                          this%pml_right_Hz_ii, this%pml_right_Hz_ij, &
+                                          this%pml_right_Hz_ji, this%pml_right_Hz_jj, &
+                                          this%pml_right_Hz_ki, this%pml_right_Hz_kj, &
+                                          b%Hz%XI-1)
+   end subroutine gpu_advanceCPML_H_right
+
+   subroutine gpu_advanceCPML_Ex_right_kernel(Ex_d, Hz_d, sggMiEx_d, psi_Exy_d, &
+                                              P_be_y_d, P_ce_y_d, g2_d, numMedia, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      integer, intent(in) :: numMedia
+      real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hz_d, psi_Exy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d
+      real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEx_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hz_d(i,j,k) - Hz_d(i,j-1,k)) * P_ce_y_d(j)
+              Ex_d(i,j,k) = Ex_d(i,j,k) + g2_d(medio) * psi_Exy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ex_right_kernel
+
+   subroutine gpu_advanceCPML_Ez_right_kernel(Ez_d, Hx_d, sggMiEz_d, psi_Ezy_d, &
+                                              P_be_y_d, P_ce_y_d, g2_d, numMedia, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      integer, intent(in) :: numMedia
+      real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, psi_Ezy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d
+      real(kind=rkind), device, dimension(:) :: P_be_y_d, P_ce_y_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) = P_be_y_d(j) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hx_d(i,j,k) - Hx_d(i,j-1,k)) * P_ce_y_d(j)
+              Ez_d(i,j,k) = Ez_d(i,j,k) - g2_d(medio) * psi_Ezy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ez_right_kernel
+
+   subroutine gpu_advanceCPML_Hx_right_kernel(Hx_d, Ez_d, sggMiHx_d, psi_Hxy_d, &
+                                              P_bm_y_d, P_cm_y_d, gm2_d, numMedia, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      integer, intent(in) :: numMedia
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ez_d, psi_Hxy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHx_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ez_d(i,j+1,k) - Ez_d(i,j,k)) * P_cm_y_d(j)
+              Hx_d(i,j,k) = Hx_d(i,j,k) - gm2_d(medio) * psi_Hxy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hx_right_kernel
+
+   subroutine gpu_advanceCPML_Hz_right_kernel(Hz_d, Ex_d, sggMiHz_d, psi_Hzy_d, &
+                                              P_bm_y_d, P_cm_y_d, gm2_d, numMedia, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      integer, intent(in) :: numMedia
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, psi_Hzy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: P_bm_y_d, P_cm_y_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_y_d(j) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ex_d(i,j+1,k) - Ex_d(i,j,k)) * P_cm_y_d(j)
+              Hz_d(i,j,k) = Hz_d(i,j,k) + gm2_d(medio) * psi_Hzy_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hz_right_kernel
+
+   !--------------------------------------------------------------------------------
+   ! CPML Down Boundary - GPU accelerated kernels (z-dependent coefficients)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_down(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_down_initialized) return
+
+      call gpu_advanceCPML_Ey_down_kernel(this%Ey_d, this%Hx_d, this%sggMiEy_d, &
+                                          this%pml_psi_Eyz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, &
+                                          this%g2_d, &
+                                          this%pml_down_Ey_ii, this%pml_down_Ey_ij, &
+                                          this%pml_down_Ey_ji, this%pml_down_Ey_jj, &
+                                          this%pml_down_Ey_ki, this%pml_down_Ey_kj, &
+                                          b%Ey%ZI-1)
+
+      call gpu_advanceCPML_Ex_down_kernel(this%Ex_d, this%Hy_d, this%sggMiEx_d, &
+                                          this%pml_psi_Exz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, &
+                                          this%g2_d, &
+                                          this%pml_down_Ex_ii, this%pml_down_Ex_ij, &
+                                          this%pml_down_Ex_ji, this%pml_down_Ex_jj, &
+                                          this%pml_down_Ex_ki, this%pml_down_Ex_kj, &
+                                          b%Ex%ZI-1)
+
+   end subroutine gpu_advanceCPML_E_down
+
+   subroutine gpu_advanceCPML_Ey_down_kernel(Ey_d, Hx_d, sggMiEy_d, psi_Eyz_d, &
+                                             P_be_z_d, P_ce_z_d, g2_d, &
+                                             ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hx_d, psi_Eyz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d
+      real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEy_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hx_d(i,j,k) - Hx_d(i,j,k-1)) * P_ce_z_d(k)
+              Ey_d(i,j,k) = Ey_d(i,j,k) + g2_d(medio) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ey_down_kernel
+
+   subroutine gpu_advanceCPML_Ex_down_kernel(Ex_d, Hy_d, sggMiEx_d, psi_Exz_d, &
+                                             P_be_z_d, P_ce_z_d, g2_d, &
+                                             ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, psi_Exz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d
+      real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEx_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i,j,k-1)) * P_ce_z_d(k)
+              Ex_d(i,j,k) = Ex_d(i,j,k) - g2_d(medio) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ex_down_kernel
+
+   subroutine gpu_advanceCPML_H_down(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_down_initialized) return
+
+      call gpu_advanceCPML_Hy_down_kernel(this%Hy_d, this%Ex_d, this%sggMiHy_d, &
+                                          this%pml_psi_Hyz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, &
+                                          this%gm2_d, &
+                                          this%pml_down_Hy_ii, this%pml_down_Hy_ij, &
+                                          this%pml_down_Hy_ji, this%pml_down_Hy_jj, &
+                                          this%pml_down_Hy_ki, this%pml_down_Hy_kj, &
+                                          b%Hy%ZI-1)
+
+      call gpu_advanceCPML_Hx_down_kernel(this%Hx_d, this%Ey_d, this%sggMiHx_d, &
+                                          this%pml_psi_Hxz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, &
+                                          this%gm2_d, &
+                                          this%pml_down_Hx_ii, this%pml_down_Hx_ij, &
+                                          this%pml_down_Hx_ji, this%pml_down_Hx_jj, &
+                                          this%pml_down_Hx_ki, this%pml_down_Hx_kj, &
+                                          b%Hx%ZI-1)
+
+   end subroutine gpu_advanceCPML_H_down
+
+   subroutine gpu_advanceCPML_Hy_down_kernel(Hy_d, Ex_d, sggMiHy_d, psi_Hyz_d, &
+                                             P_bm_z_d, P_cm_z_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ex_d, psi_Hyz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHy_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ex_d(i,j,k+1) - Ex_d(i,j,k)) * P_cm_z_d(k)
+              Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hy_down_kernel
+
+   subroutine gpu_advanceCPML_Hx_down_kernel(Hx_d, Ey_d, sggMiHx_d, psi_Hxz_d, &
+                                             P_bm_z_d, P_cm_z_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, psi_Hxz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHx_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ey_d(i,j,k+1) - Ey_d(i,j,k)) * P_cm_z_d(k)
+              Hx_d(i,j,k) = Hx_d(i,j,k) + gm2_d(medio) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hx_down_kernel
+
+   !--------------------------------------------------------------------------------
+   ! CPML Up Boundary - GPU accelerated kernels (z-dependent, same as down)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_up(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_up_initialized) return
+
+      call gpu_advanceCPML_Ey_up_kernel(this%Ey_d, this%Hx_d, this%sggMiEy_d, &
+                                        this%pml_psi_Eyz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, &
+                                        this%g2_d, &
+                                        this%pml_up_Ey_ii, this%pml_up_Ey_ij, &
+                                        this%pml_up_Ey_ji, this%pml_up_Ey_jj, &
+                                        this%pml_up_Ey_ki, this%pml_up_Ey_kj, &
+                                        b%Ey%ZI-1)
+
+      call gpu_advanceCPML_Ex_up_kernel(this%Ex_d, this%Hy_d, this%sggMiEx_d, &
+                                        this%pml_psi_Exz_down, this%pml_P_be_z_down, this%pml_P_ce_z_down, &
+                                        this%g2_d, &
+                                        this%pml_up_Ex_ii, this%pml_up_Ex_ij, &
+                                        this%pml_up_Ex_ji, this%pml_up_Ex_jj, &
+                                        this%pml_up_Ex_ki, this%pml_up_Ex_kj, &
+                                        b%Ex%ZI-1)
+
+   end subroutine gpu_advanceCPML_E_up
+
+   subroutine gpu_advanceCPML_Ey_up_kernel(Ey_d, Hx_d, sggMiEy_d, psi_Eyz_d, &
+                                           P_be_z_d, P_ce_z_d, g2_d, &
+                                           ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hx_d, psi_Eyz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d
+      real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEy_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hx_d(i,j,k) - Hx_d(i,j,k-1)) * P_ce_z_d(k)
+              Ey_d(i,j,k) = Ey_d(i,j,k) + g2_d(medio) * psi_Eyz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ey_up_kernel
+
+   subroutine gpu_advanceCPML_Ex_up_kernel(Ex_d, Hy_d, sggMiEx_d, psi_Exz_d, &
+                                           P_be_z_d, P_ce_z_d, g2_d, &
+                                           ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, psi_Exz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d
+      real(kind=rkind), device, dimension(:) :: P_be_z_d, P_ce_z_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEx_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) = P_be_z_d(k) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i,j,k-1)) * P_ce_z_d(k)
+              Ex_d(i,j,k) = Ex_d(i,j,k) - g2_d(medio) * psi_Exz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ex_up_kernel
+
+   subroutine gpu_advanceCPML_H_up(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_up_initialized) return
+
+      call gpu_advanceCPML_Hy_up_kernel(this%Hy_d, this%Ex_d, this%sggMiHy_d, &
+                                        this%pml_psi_Hyz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, &
+                                        this%gm2_d, &
+                                        this%pml_up_Hy_ii, this%pml_up_Hy_ij, &
+                                        this%pml_up_Hy_ji, this%pml_up_Hy_jj, &
+                                        this%pml_up_Hy_ki, this%pml_up_Hy_kj, &
+                                        b%Hy%ZI-1)
+
+      call gpu_advanceCPML_Hx_up_kernel(this%Hx_d, this%Ey_d, this%sggMiHx_d, &
+                                        this%pml_psi_Hxz_down, this%pml_P_bm_z_down, this%pml_P_cm_z_down, &
+                                        this%gm2_d, &
+                                        this%pml_up_Hx_ii, this%pml_up_Hx_ij, &
+                                        this%pml_up_Hx_ji, this%pml_up_Hx_jj, &
+                                        this%pml_up_Hx_ki, this%pml_up_Hx_kj, &
+                                        b%Hx%ZI-1)
+
+   end subroutine gpu_advanceCPML_H_up
+
+   subroutine gpu_advanceCPML_Hy_up_kernel(Hy_d, Ex_d, sggMiHy_d, psi_Hyz_d, &
+                                           P_bm_z_d, P_cm_z_d, gm2_d, &
+                                           ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ex_d, psi_Hyz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHy_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ex_d(i,j,k+1) - Ex_d(i,j,k)) * P_cm_z_d(k)
+              Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hy_up_kernel
+
+   subroutine gpu_advanceCPML_Hx_up_kernel(Hx_d, Ey_d, sggMiHx_d, psi_Hxz_d, &
+                                           P_bm_z_d, P_cm_z_d, gm2_d, &
+                                           ii, ij, ji, jj, ki, kj, zi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, zi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, psi_Hxz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: P_bm_z_d, P_cm_z_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHx_d(i-zi_offset,j-zi_offset,k-zi_offset)
+              psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_z_d(k) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ey_d(i,j,k+1) - Ey_d(i,j,k)) * P_cm_z_d(k)
+              Hx_d(i,j,k) = Hx_d(i,j,k) + gm2_d(medio) * psi_Hxz_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hx_up_kernel
+
+   !--------------------------------------------------------------------------------
+   ! CPML Back Boundary - GPU accelerated kernels (x-dependent coefficients)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_back(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_back_initialized) return
+
+      call gpu_advanceCPML_Ez_back_kernel(this%Ez_d, this%Hy_d, this%sggMiEz_d, &
+                                          this%pml_psi_Ezx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, &
+                                          this%g2_d, &
+                                          this%pml_back_Ez_ii, this%pml_back_Ez_ij, &
+                                          this%pml_back_Ez_ji, this%pml_back_Ez_jj, &
+                                          this%pml_back_Ez_ki, this%pml_back_Ez_kj, &
+                                          b%Ez%XI-1)
+
+      call gpu_advanceCPML_Ey_back_kernel(this%Ey_d, this%Hz_d, this%sggMiEy_d, &
+                                          this%pml_psi_Eyx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, &
+                                          this%g2_d, &
+                                          this%pml_back_Ey_ii, this%pml_back_Ey_ij, &
+                                          this%pml_back_Ey_ji, this%pml_back_Ey_jj, &
+                                          this%pml_back_Ey_ki, this%pml_back_Ey_kj, &
+                                          b%Ey%XI-1)
+
+   end subroutine gpu_advanceCPML_E_back
+
+   subroutine gpu_advanceCPML_Ez_back_kernel(Ez_d, Hy_d, sggMiEz_d, psi_Ezx_d, &
+                                             P_be_x_d, P_ce_x_d, g2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hy_d, psi_Ezx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d
+      real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_ce_x_d(i)
+              Ez_d(i,j,k) = Ez_d(i,j,k) + g2_d(medio) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ez_back_kernel
+
+   subroutine gpu_advanceCPML_Ey_back_kernel(Ey_d, Hz_d, sggMiEy_d, psi_Eyx_d, &
+                                             P_be_x_d, P_ce_x_d, g2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, psi_Eyx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d
+      real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEy_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hz_d(i,j,k) - Hz_d(i-1,j,k)) * P_ce_x_d(i)
+              Ey_d(i,j,k) = Ey_d(i,j,k) - g2_d(medio) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ey_back_kernel
+
+   subroutine gpu_advanceCPML_H_back(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_back_initialized) return
+
+      call gpu_advanceCPML_Hz_back_kernel(this%Hz_d, this%Hy_d, this%sggMiHz_d, &
+                                          this%pml_psi_Hzx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, &
+                                          this%gm2_d, &
+                                          this%pml_back_Hz_ii, this%pml_back_Hz_ij, &
+                                          this%pml_back_Hz_ji, this%pml_back_Hz_jj, &
+                                          this%pml_back_Hz_ki, this%pml_back_Hz_kj, &
+                                          b%Hz%XI-1)
+
+      call gpu_advanceCPML_Hy_back_kernel(this%Hy_d, this%Ez_d, this%sggMiHy_d, &
+                                          this%pml_psi_Hyx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, &
+                                          this%gm2_d, &
+                                          this%pml_back_Hy_ii, this%pml_back_Hy_ij, &
+                                          this%pml_back_Hy_ji, this%pml_back_Hy_jj, &
+                                          this%pml_back_Hy_ki, this%pml_back_Hy_kj, &
+                                          b%Hy%XI-1)
+
+   end subroutine gpu_advanceCPML_H_back
+
+   subroutine gpu_advanceCPML_Hz_back_kernel(Hz_d, Hy_d, sggMiHz_d, psi_Hzx_d, &
+                                             P_bm_x_d, P_cm_x_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Hy_d, psi_Hzx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_cm_x_d(i)
+              Hz_d(i,j,k) = Hz_d(i,j,k) - gm2_d(medio) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hz_back_kernel
+
+   subroutine gpu_advanceCPML_Hy_back_kernel(Hy_d, Ez_d, sggMiHy_d, psi_Hyx_d, &
+                                             P_bm_x_d, P_cm_x_d, gm2_d, &
+                                             ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, psi_Hyx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHy_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ez_d(i,j,k) - Ez_d(i-1,j,k)) * P_cm_x_d(i)
+              Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hy_back_kernel
+
+   !--------------------------------------------------------------------------------
+   ! CPML Front Boundary - GPU accelerated kernels (x-dependent, same as back)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceCPML_E_front(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_front_initialized) return
+
+      call gpu_advanceCPML_Ez_front_kernel(this%Ez_d, this%Hy_d, this%sggMiEz_d, &
+                                           this%pml_psi_Ezx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, &
+                                           this%g2_d, &
+                                           this%pml_front_Ez_ii, this%pml_front_Ez_ij, &
+                                           this%pml_front_Ez_ji, this%pml_front_Ez_jj, &
+                                           this%pml_front_Ez_ki, this%pml_front_Ez_kj, &
+                                           b%Ez%XI-1)
+
+      call gpu_advanceCPML_Ey_front_kernel(this%Ey_d, this%Hz_d, this%sggMiEy_d, &
+                                           this%pml_psi_Eyx_back, this%pml_P_be_x_back, this%pml_P_ce_x_back, &
+                                           this%g2_d, &
+                                           this%pml_front_Ey_ii, this%pml_front_Ey_ij, &
+                                           this%pml_front_Ey_ji, this%pml_front_Ey_jj, &
+                                           this%pml_front_Ey_ki, this%pml_front_Ey_kj, &
+                                           b%Ey%XI-1)
+
+   end subroutine gpu_advanceCPML_E_front
+
+   subroutine gpu_advanceCPML_Ez_front_kernel(Ez_d, Hy_d, sggMiEz_d, psi_Ezx_d, &
+                                              P_be_x_d, P_ce_x_d, g2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hy_d, psi_Ezx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d
+      real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_ce_x_d(i)
+              Ez_d(i,j,k) = Ez_d(i,j,k) + g2_d(medio) * psi_Ezx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ez_front_kernel
+
+   subroutine gpu_advanceCPML_Ey_front_kernel(Ey_d, Hz_d, sggMiEy_d, psi_Eyx_d, &
+                                              P_be_x_d, P_ce_x_d, g2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, psi_Eyx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d
+      real(kind=rkind), device, dimension(:) :: P_be_x_d, P_ce_x_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiEy_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) = P_be_x_d(i) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hz_d(i,j,k) - Hz_d(i-1,j,k)) * P_ce_x_d(i)
+              Ey_d(i,j,k) = Ey_d(i,j,k) - g2_d(medio) * psi_Eyx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Ey_front_kernel
+
+   subroutine gpu_advanceCPML_H_front(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%pml_front_initialized) return
+
+      call gpu_advanceCPML_Hz_front_kernel(this%Hz_d, this%Hy_d, this%sggMiHz_d, &
+                                           this%pml_psi_Hzx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, &
+                                           this%gm2_d, &
+                                           this%pml_front_Hz_ii, this%pml_front_Hz_ij, &
+                                           this%pml_front_Hz_ji, this%pml_front_Hz_jj, &
+                                           this%pml_front_Hz_ki, this%pml_front_Hz_kj, &
+                                           b%Hz%XI-1)
+
+      call gpu_advanceCPML_Hy_front_kernel(this%Hy_d, this%Ez_d, this%sggMiHy_d, &
+                                           this%pml_psi_Hyx_back, this%pml_P_bm_x_back, this%pml_P_cm_x_back, &
+                                           this%gm2_d, &
+                                           this%pml_front_Hy_ii, this%pml_front_Hy_ij, &
+                                           this%pml_front_Hy_ji, this%pml_front_Hy_jj, &
+                                           this%pml_front_Hy_ki, this%pml_front_Hy_kj, &
+                                           b%Hy%XI-1)
+
+   end subroutine gpu_advanceCPML_H_front
+
+   subroutine gpu_advanceCPML_Hz_front_kernel(Hz_d, Hy_d, sggMiHz_d, psi_Hzx_d, &
+                                              P_bm_x_d, P_cm_x_d, gm2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Hy_d, psi_Hzx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHz_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Hy_d(i,j,k) - Hy_d(i-1,j,k)) * P_cm_x_d(i)
+              Hz_d(i,j,k) = Hz_d(i,j,k) - gm2_d(medio) * psi_Hzx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hz_front_kernel
+
+   subroutine gpu_advanceCPML_Hy_front_kernel(Hy_d, Ez_d, sggMiHy_d, psi_Hyx_d, &
+                                              P_bm_x_d, P_cm_x_d, gm2_d, &
+                                              ii, ij, ji, jj, ki, kj, xi_offset)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj, xi_offset
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, psi_Hyx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: P_bm_x_d, P_cm_x_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=ki,kj
+         do j=ji,jj
+            do i=ii,ij
+              medio = sggMiHy_d(i-xi_offset,j-xi_offset,k-xi_offset)
+              psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) = P_bm_x_d(i) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1) + &
+                 (Ez_d(i,j,k) - Ez_d(i-1,j,k)) * P_cm_x_d(i)
+              Hy_d(i,j,k) = Hy_d(i,j,k) - gm2_d(medio) * psi_Hyx_d(i-ii+1,j-ji+1,k-ki+1)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceCPML_Hy_front_kernel
+
+end module gpu_cpml_m
diff --git a/src_main_pub/gpu_mur_m.F90 b/src_main_pub/gpu_mur_m.F90
new file mode 100644
index 000000000..30285ca3c
--- /dev/null
+++ b/src_main_pub/gpu_mur_m.F90
@@ -0,0 +1,1115 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU MUR KERNELS MODULE - CUDA Fortran (CUF) accelerated MUR boundaries
+!  First-order Mur ABC for all 6 boundaries with persistent state on device.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_mur_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! MUR Left boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_left(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hx_left_kernel(this%Hx_d, this%sggMiHx_d, &
+                                         this%mur_past_Hx_left, this%mur_left_CAB1, &
+                                         this%mur_left_Hx_ii, this%mur_left_Hx_ij, &
+                                         this%mur_left_Hx_ji, this%mur_left_Hx_jj, &
+                                         this%mur_left_Hx_ki, this%mur_left_Hx_kj)
+
+      call gpu_advanceMUR_Hz_left_kernel(this%Hz_d, this%sggMiHz_d, &
+                                         this%mur_past_Hz_left, this%mur_left_CAB1, &
+                                         this%mur_left_Hz_ii, this%mur_left_Hz_ij, &
+                                         this%mur_left_Hz_ji, this%mur_left_Hz_jj, &
+                                         this%mur_left_Hz_ki, this%mur_left_Hz_kj)
+
+   end subroutine gpu_advanceMUR_H_left
+
+   subroutine gpu_advanceMUR_Hx_left_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do k=ki,kj
+         do i=ii,ij
+            medio = sggMiHx_d(i, ji+1, k)
+            Hx_d(i, ji, k) = past_Hx_d(i, ji+1, k) + &
+                             CAB1_d(medio) * (Hx_d(i, ji+1, k) - past_Hx_d(i, ji, k))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hx_left_kernel
+
+   subroutine gpu_advanceMUR_Hz_left_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do k=ki,kj
+         do i=ii,ij
+            medio = sggMiHz_d(i, ji+1, k)
+            Hz_d(i, ji, k) = past_Hz_d(i, ji+1, k) + &
+                             CAB1_d(medio) * (Hz_d(i, ji+1, k) - past_Hz_d(i, ji, k))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hz_left_kernel
+
+   !--------------------------------------------------------------------------------
+   ! MUR Right boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_right(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hx_right_kernel(this%Hx_d, this%sggMiHx_d, &
+                                          this%mur_past_Hx_right, this%mur_right_CAB1, &
+                                          this%mur_right_Hx_ii, this%mur_right_Hx_ij, &
+                                          this%mur_right_Hx_ji, this%mur_right_Hx_jj, &
+                                          this%mur_right_Hx_ki, this%mur_right_Hx_kj)
+
+      call gpu_advanceMUR_Hz_right_kernel(this%Hz_d, this%sggMiHz_d, &
+                                          this%mur_past_Hz_right, this%mur_right_CAB1, &
+                                          this%mur_right_Hz_ii, this%mur_right_Hz_ij, &
+                                          this%mur_right_Hz_ji, this%mur_right_Hz_jj, &
+                                          this%mur_right_Hz_ki, this%mur_right_Hz_kj)
+
+   end subroutine gpu_advanceMUR_H_right
+
+   subroutine gpu_advanceMUR_Hx_right_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, &
+                                             ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do k=ki,kj
+         do i=ii,ij
+            medio = sggMiHx_d(i, ji-1, k)
+            Hx_d(i, ji, k) = past_Hx_d(i, ji-1, k) + &
+                             CAB1_d(medio) * (Hx_d(i, ji-1, k) - past_Hx_d(i, ji, k))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hx_right_kernel
+
+   subroutine gpu_advanceMUR_Hz_right_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, &
+                                             ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, k
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do k=ki,kj
+         do i=ii,ij
+            medio = sggMiHz_d(i, ji-1, k)
+            Hz_d(i, ji, k) = past_Hz_d(i, ji-1, k) + &
+                             CAB1_d(medio) * (Hz_d(i, ji-1, k) - past_Hz_d(i, ji, k))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hz_right_kernel
+
+   !--------------------------------------------------------------------------------
+   ! MUR Down boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_down(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hy_down_kernel(this%Hy_d, this%sggMiHy_d, &
+                                         this%mur_past_Hy_down, this%mur_down_CAB1, &
+                                         this%mur_down_Hy_ii, this%mur_down_Hy_ij, &
+                                         this%mur_down_Hy_ji, this%mur_down_Hy_jj, &
+                                         this%mur_down_Hy_ki, this%mur_down_Hy_kj)
+
+      call gpu_advanceMUR_Hx_down_kernel(this%Hx_d, this%sggMiHx_d, &
+                                         this%mur_past_Hx_down, this%mur_down_CAB1, &
+                                         this%mur_down_Hx_ii, this%mur_down_Hx_ij, &
+                                         this%mur_down_Hx_ji, this%mur_down_Hx_jj, &
+                                         this%mur_down_Hx_ki, this%mur_down_Hx_kj)
+
+   end subroutine gpu_advanceMUR_H_down
+
+   subroutine gpu_advanceMUR_Hy_down_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHy_d(i, j, ki+1)
+            Hy_d(i, j, ki) = past_Hy_d(i, j, ki+1) + &
+                             CAB1_d(medio) * (Hy_d(i, j, ki+1) - past_Hy_d(i, j, ki))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hy_down_kernel
+
+   subroutine gpu_advanceMUR_Hx_down_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHx_d(i, j, ki+1)
+            Hx_d(i, j, ki) = past_Hx_d(i, j, ki+1) + &
+                             CAB1_d(medio) * (Hx_d(i, j, ki+1) - past_Hx_d(i, j, ki))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hx_down_kernel
+
+   !--------------------------------------------------------------------------------
+   ! MUR Up boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_up(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hy_up_kernel(this%Hy_d, this%sggMiHy_d, &
+                                       this%mur_past_Hy_up, this%mur_up_CAB1, &
+                                       this%mur_up_Hy_ii, this%mur_up_Hy_ij, &
+                                       this%mur_up_Hy_ji, this%mur_up_Hy_jj, &
+                                       this%mur_up_Hy_ki, this%mur_up_Hy_kj)
+
+      call gpu_advanceMUR_Hx_up_kernel(this%Hx_d, this%sggMiHx_d, &
+                                       this%mur_past_Hx_up, this%mur_up_CAB1, &
+                                       this%mur_up_Hx_ii, this%mur_up_Hx_ij, &
+                                       this%mur_up_Hx_ji, this%mur_up_Hx_jj, &
+                                       this%mur_up_Hx_ki, this%mur_up_Hx_kj)
+
+   end subroutine gpu_advanceMUR_H_up
+
+   subroutine gpu_advanceMUR_Hy_up_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, &
+                                          ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHy_d(i, j, kj-1)
+            Hy_d(i, j, kj) = past_Hy_d(i, j, kj-1) + &
+                             CAB1_d(medio) * (Hy_d(i, j, kj-1) - past_Hy_d(i, j, kj))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hy_up_kernel
+
+   subroutine gpu_advanceMUR_Hx_up_kernel(Hx_d, sggMiHx_d, past_Hx_d, CAB1_d, &
+                                          ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHx_d(i, j, kj-1)
+            Hx_d(i, j, kj) = past_Hx_d(i, j, kj-1) + &
+                             CAB1_d(medio) * (Hx_d(i, j, kj-1) - past_Hx_d(i, j, kj))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hx_up_kernel
+
+   !--------------------------------------------------------------------------------
+   ! MUR Back boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_back(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hz_back_kernel(this%Hz_d, this%sggMiHz_d, &
+                                         this%mur_past_Hz_back, this%mur_back_CAB1, &
+                                         this%mur_back_Hz_ii, this%mur_back_Hz_ij, &
+                                         this%mur_back_Hz_ji, this%mur_back_Hz_jj, &
+                                         this%mur_back_Hz_ki, this%mur_back_Hz_kj)
+
+      call gpu_advanceMUR_Hy_back_kernel(this%Hy_d, this%sggMiHy_d, &
+                                         this%mur_past_Hy_back, this%mur_back_CAB1, &
+                                         this%mur_back_Hy_ii, this%mur_back_Hy_ij, &
+                                         this%mur_back_Hy_ji, this%mur_back_Hy_jj, &
+                                         this%mur_back_Hy_ki, this%mur_back_Hy_kj)
+
+   end subroutine gpu_advanceMUR_H_back
+
+   subroutine gpu_advanceMUR_Hz_back_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHz_d(i, j, ki-1)
+            Hz_d(i, j, ki) = past_Hz_d(i, j, ki-1) + &
+                             CAB1_d(medio) * (Hz_d(i, j, ki-1) - past_Hz_d(i, j, ki))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hz_back_kernel
+
+   subroutine gpu_advanceMUR_Hy_back_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, &
+                                            ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHy_d(i, j, ki-1)
+            Hy_d(i, j, ki) = past_Hy_d(i, j, ki-1) + &
+                             CAB1_d(medio) * (Hy_d(i, j, ki-1) - past_Hy_d(i, j, ki))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hy_back_kernel
+
+   !--------------------------------------------------------------------------------
+   ! MUR Front boundary - GPU accelerated kernels
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceMUR_H_front(this, b)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: b
+
+      if (.not. this%mur_initialized) return
+
+      call gpu_advanceMUR_Hz_front_kernel(this%Hz_d, this%sggMiHz_d, &
+                                          this%mur_past_Hz_front, this%mur_front_CAB1, &
+                                          this%mur_front_Hz_ii, this%mur_front_Hz_ij, &
+                                          this%mur_front_Hz_ji, this%mur_front_Hz_jj, &
+                                          this%mur_front_Hz_ki, this%mur_front_Hz_kj)
+
+      call gpu_advanceMUR_Hy_front_kernel(this%Hy_d, this%sggMiHy_d, &
+                                          this%mur_past_Hy_front, this%mur_front_CAB1, &
+                                          this%mur_front_Hy_ii, this%mur_front_Hy_ij, &
+                                          this%mur_front_Hy_ji, this%mur_front_Hy_jj, &
+                                          this%mur_front_Hy_ki, this%mur_front_Hy_kj)
+
+   end subroutine gpu_advanceMUR_H_front
+
+   subroutine gpu_advanceMUR_Hz_front_kernel(Hz_d, sggMiHz_d, past_Hz_d, CAB1_d, &
+                                             ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHz_d(i, j, kj+1)
+            Hz_d(i, j, kj) = past_Hz_d(i, j, kj+1) + &
+                             CAB1_d(medio) * (Hz_d(i, j, kj+1) - past_Hz_d(i, j, kj))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hz_front_kernel
+
+   subroutine gpu_advanceMUR_Hy_front_kernel(Hy_d, sggMiHy_d, past_Hy_d, CAB1_d, &
+                                             ii, ij, ji, jj, ki, kj)
+      integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: CAB1_d
+
+      integer(kind=4) :: i, j
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(2) 
+      do j=ji,jj
+         do i=ii,ij
+            medio = sggMiHy_d(i, j, kj+1)
+            Hy_d(i, j, kj) = past_Hy_d(i, j, kj+1) + &
+                             CAB1_d(medio) * (Hy_d(i, j, kj+1) - past_Hy_d(i, j, kj))
+         end do
+      end do
+   end subroutine gpu_advanceMUR_Hy_front_kernel
+
+    !--------------------------------------------------------------------------------
+    ! Update MUR past fields on device - copy current Hx/Hy/Hz to past arrays
+    ! Called after each MUR step to prepare for next timestep
+    !--------------------------------------------------------------------------------
+    subroutine gpu_update_mur_past_left(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hx_left_kernel(this%Hx_d, this%mur_past_Hx_left, &
+                                               this%mur_left_Hx_ii, this%mur_left_Hx_ij, &
+                                               this%mur_left_Hx_ji, this%mur_left_Hx_jj, &
+                                               this%mur_left_Hx_ki, this%mur_left_Hx_kj)
+
+       call gpu_update_mur_past_Hz_left_kernel(this%Hz_d, this%mur_past_Hz_left, &
+                                               this%mur_left_Hz_ii, this%mur_left_Hz_ij, &
+                                               this%mur_left_Hz_ji, this%mur_left_Hz_jj, &
+                                               this%mur_left_Hz_ki, this%mur_left_Hz_kj)
+    end subroutine gpu_update_mur_past_left
+
+    subroutine gpu_update_mur_past_Hx_left_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hx_d(i, ji, k) = Hx_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hx_left_kernel
+
+    subroutine gpu_update_mur_past_Hz_left_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hz_d(i, ji, k) = Hz_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hz_left_kernel
+
+    subroutine gpu_update_mur_past_right(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hx_right_kernel(this%Hx_d, this%mur_past_Hx_right, &
+                                                this%mur_right_Hx_ii, this%mur_right_Hx_ij, &
+                                                this%mur_right_Hx_ji, this%mur_right_Hx_jj, &
+                                                this%mur_right_Hx_ki, this%mur_right_Hx_kj)
+
+       call gpu_update_mur_past_Hz_right_kernel(this%Hz_d, this%mur_past_Hz_right, &
+                                                this%mur_right_Hz_ii, this%mur_right_Hz_ij, &
+                                                this%mur_right_Hz_ji, this%mur_right_Hz_jj, &
+                                                this%mur_right_Hz_ki, this%mur_right_Hz_kj)
+    end subroutine gpu_update_mur_past_right
+
+    subroutine gpu_update_mur_past_Hx_right_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hx_d(i, ji, k) = Hx_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hx_right_kernel
+
+    subroutine gpu_update_mur_past_Hz_right_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hz_d(i, ji, k) = Hz_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hz_right_kernel
+
+    subroutine gpu_update_mur_past_down(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hy_down_kernel(this%Hy_d, this%mur_past_Hy_down, &
+                                               this%mur_down_Hy_ii, this%mur_down_Hy_ij, &
+                                               this%mur_down_Hy_ji, this%mur_down_Hy_jj, &
+                                               this%mur_down_Hy_ki, this%mur_down_Hy_kj)
+
+       call gpu_update_mur_past_Hx_down_kernel(this%Hx_d, this%mur_past_Hx_down, &
+                                               this%mur_down_Hx_ii, this%mur_down_Hx_ij, &
+                                               this%mur_down_Hx_ji, this%mur_down_Hx_jj, &
+                                               this%mur_down_Hx_ki, this%mur_down_Hx_kj)
+    end subroutine gpu_update_mur_past_down
+
+    subroutine gpu_update_mur_past_Hy_down_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hy_d(i, ji, k) = Hy_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hy_down_kernel
+
+    subroutine gpu_update_mur_past_Hx_down_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hx_d(i, ji, k) = Hx_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hx_down_kernel
+
+    subroutine gpu_update_mur_past_up(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hy_up_kernel(this%Hy_d, this%mur_past_Hy_up, &
+                                             this%mur_up_Hy_ii, this%mur_up_Hy_ij, &
+                                             this%mur_up_Hy_ji, this%mur_up_Hy_jj, &
+                                             this%mur_up_Hy_ki, this%mur_up_Hy_kj)
+
+       call gpu_update_mur_past_Hx_up_kernel(this%Hx_d, this%mur_past_Hx_up, &
+                                             this%mur_up_Hx_ii, this%mur_up_Hx_ij, &
+                                             this%mur_up_Hx_ji, this%mur_up_Hx_jj, &
+                                             this%mur_up_Hx_ki, this%mur_up_Hx_kj)
+    end subroutine gpu_update_mur_past_up
+
+    subroutine gpu_update_mur_past_Hy_up_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hy_d(i, ji, k) = Hy_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hy_up_kernel
+
+    subroutine gpu_update_mur_past_Hx_up_kernel(Hx_d, past_Hx_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d, past_Hx_d
+
+       integer(kind=4) :: i, k
+
+       !$cuf kernel do(2) 
+       do k=ki,kj
+          do i=ii,ij
+             past_Hx_d(i, ji, k) = Hx_d(i, ji, k)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hx_up_kernel
+
+    subroutine gpu_update_mur_past_back(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hz_back_kernel(this%Hz_d, this%mur_past_Hz_back, &
+                                               this%mur_back_Hz_ii, this%mur_back_Hz_ij, &
+                                               this%mur_back_Hz_ji, this%mur_back_Hz_jj, &
+                                               this%mur_back_Hz_ki, this%mur_back_Hz_kj)
+
+       call gpu_update_mur_past_Hy_back_kernel(this%Hy_d, this%mur_past_Hy_back, &
+                                               this%mur_back_Hy_ii, this%mur_back_Hy_ij, &
+                                               this%mur_back_Hy_ji, this%mur_back_Hy_jj, &
+                                               this%mur_back_Hy_ki, this%mur_back_Hy_kj)
+    end subroutine gpu_update_mur_past_back
+
+    subroutine gpu_update_mur_past_Hz_back_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+
+       integer(kind=4) :: i, j
+
+       !$cuf kernel do(2) 
+       do j=ji,jj
+          do i=ii,ij
+             past_Hz_d(i, j, kj) = Hz_d(i, j, kj)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hz_back_kernel
+
+    subroutine gpu_update_mur_past_Hy_back_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+
+       integer(kind=4) :: i, j
+
+       !$cuf kernel do(2) 
+       do j=ji,jj
+          do i=ii,ij
+             past_Hy_d(i, j, kj) = Hy_d(i, j, kj)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hy_back_kernel
+
+    subroutine gpu_update_mur_past_front(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call gpu_update_mur_past_Hz_front_kernel(this%Hz_d, this%mur_past_Hz_front, &
+                                                this%mur_front_Hz_ii, this%mur_front_Hz_ij, &
+                                                this%mur_front_Hz_ji, this%mur_front_Hz_jj, &
+                                                this%mur_front_Hz_ki, this%mur_front_Hz_kj)
+
+       call gpu_update_mur_past_Hy_front_kernel(this%Hy_d, this%mur_past_Hy_front, &
+                                                this%mur_front_Hy_ii, this%mur_front_Hy_ij, &
+                                                this%mur_front_Hy_ji, this%mur_front_Hy_jj, &
+                                                this%mur_front_Hy_ki, this%mur_front_Hy_kj)
+    end subroutine gpu_update_mur_past_front
+
+    subroutine gpu_update_mur_past_Hz_front_kernel(Hz_d, past_Hz_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d, past_Hz_d
+
+       integer(kind=4) :: i, j
+
+       !$cuf kernel do(2) 
+       do j=ji,jj
+          do i=ii,ij
+             past_Hz_d(i, j, kj) = Hz_d(i, j, kj)
+          end do
+       end do
+    end subroutine gpu_update_mur_past_Hz_front_kernel
+
+    subroutine gpu_update_mur_past_Hy_front_kernel(Hy_d, past_Hy_d, ii, ij, ji, jj, ki, kj)
+       integer(kind=4), intent(in) :: ii, ij, ji, jj, ki, kj
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d, past_Hy_d
+
+       integer(kind=4) :: i, j
+
+       !$cuf kernel do(2) 
+       do j=ji,jj
+          do i=ii,ij
+             past_Hy_d(i, j, kj) = Hy_d(i, j, kj)
+          end do
+       end do
+   end subroutine gpu_update_mur_past_Hy_front_kernel
+
+    !--------------------------------------------------------------------------------
+    ! FUSED MUR KERNELS — 3 kernels instead of 24 (advance + update_past per field)
+    ! Reduces kernel launches from 300K to ~30K (3 per timestep)
+    !--------------------------------------------------------------------------------
+
+    ! Fused advance for Hx: handles left, right, down, up boundaries
+    subroutine gpu_fused_mur_advance_hx(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_advance_hx_kernel( &
+          this%Hx_d, this%sggMiHx_d, &
+          this%mur_past_Hx_left, this%mur_past_Hx_right, &
+          this%mur_past_Hx_down, this%mur_past_Hx_up, &
+          this%mur_left_Hx_ii, this%mur_left_Hx_ij, this%mur_left_Hx_ji, this%mur_left_Hx_jj, this%mur_left_Hx_ki, this%mur_left_Hx_kj, &
+          this%mur_right_Hx_ii, this%mur_right_Hx_ij, this%mur_right_Hx_ji, this%mur_right_Hx_jj, this%mur_right_Hx_ki, this%mur_right_Hx_kj, &
+          this%mur_down_Hx_ii, this%mur_down_Hx_ij, this%mur_down_Hx_ji, this%mur_down_Hx_jj, this%mur_down_Hx_ki, this%mur_down_Hx_kj, &
+          this%mur_up_Hx_ii, this%mur_up_Hx_ij, this%mur_up_Hx_ji, this%mur_up_Hx_jj, this%mur_up_Hx_ki, this%mur_up_Hx_kj, &
+          this%mur_left_CAB1, this%mur_right_CAB1, this%mur_down_CAB1, this%mur_up_CAB1)
+
+    end subroutine gpu_fused_mur_advance_hx
+
+    ! Fused advance for Hy: handles down, up, back, front boundaries
+    subroutine gpu_fused_mur_advance_hy(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_advance_hy_kernel( &
+          this%Hy_d, this%sggMiHy_d, &
+          this%mur_past_Hy_down, this%mur_past_Hy_up, &
+          this%mur_past_Hy_back, this%mur_past_Hy_front, &
+          this%mur_down_Hy_ii, this%mur_down_Hy_ij, this%mur_down_Hy_ji, this%mur_down_Hy_jj, this%mur_down_Hy_ki, this%mur_down_Hy_kj, &
+          this%mur_up_Hy_ii, this%mur_up_Hy_ij, this%mur_up_Hy_ji, this%mur_up_Hy_jj, this%mur_up_Hy_ki, this%mur_up_Hy_kj, &
+          this%mur_back_Hy_ii, this%mur_back_Hy_ij, this%mur_back_Hy_ji, this%mur_back_Hy_jj, this%mur_back_Hy_ki, this%mur_back_Hy_kj, &
+          this%mur_front_Hy_ii, this%mur_front_Hy_ij, this%mur_front_Hy_ji, this%mur_front_Hy_jj, this%mur_front_Hy_ki, this%mur_front_Hy_kj, &
+          this%mur_down_CAB1, this%mur_up_CAB1, this%mur_back_CAB1, this%mur_front_CAB1)
+
+    end subroutine gpu_fused_mur_advance_hy
+
+    ! Fused advance for Hz: handles left, right, back, front boundaries
+    subroutine gpu_fused_mur_advance_hz(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_advance_hz_kernel( &
+          this%Hz_d, this%sggMiHz_d, &
+          this%mur_past_Hz_left, this%mur_past_Hz_right, &
+          this%mur_past_Hz_back, this%mur_past_Hz_front, &
+          this%mur_left_Hz_ii, this%mur_left_Hz_ij, this%mur_left_Hz_ji, this%mur_left_Hz_jj, this%mur_left_Hz_ki, this%mur_left_Hz_kj, &
+          this%mur_right_Hz_ii, this%mur_right_Hz_ij, this%mur_right_Hz_ji, this%mur_right_Hz_jj, this%mur_right_Hz_ki, this%mur_right_Hz_kj, &
+          this%mur_back_Hz_ii, this%mur_back_Hz_ij, this%mur_back_Hz_ji, this%mur_back_Hz_jj, this%mur_back_Hz_ki, this%mur_back_Hz_kj, &
+          this%mur_front_Hz_ii, this%mur_front_Hz_ij, this%mur_front_Hz_ji, this%mur_front_Hz_jj, this%mur_front_Hz_ki, this%mur_front_Hz_kj, &
+          this%mur_left_CAB1, this%mur_right_CAB1, this%mur_back_CAB1, this%mur_front_CAB1)
+
+    end subroutine gpu_fused_mur_advance_hz
+
+    ! Fused update_past for Hx: left, right, down, up
+    subroutine gpu_fused_mur_update_past_hx(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_update_past_hx_kernel( &
+          this%Hx_d, &
+          this%mur_past_Hx_left, this%mur_past_Hx_right, &
+          this%mur_past_Hx_down, this%mur_past_Hx_up, &
+          this%mur_left_Hx_ii, this%mur_left_Hx_ij, this%mur_left_Hx_ji, this%mur_left_Hx_jj, this%mur_left_Hx_ki, this%mur_left_Hx_kj, &
+          this%mur_right_Hx_ii, this%mur_right_Hx_ij, this%mur_right_Hx_ji, this%mur_right_Hx_jj, this%mur_right_Hx_ki, this%mur_right_Hx_kj, &
+          this%mur_down_Hx_ii, this%mur_down_Hx_ij, this%mur_down_Hx_ji, this%mur_down_Hx_jj, this%mur_down_Hx_ki, this%mur_down_Hx_kj, &
+          this%mur_up_Hx_ii, this%mur_up_Hx_ij, this%mur_up_Hx_ji, this%mur_up_Hx_jj, this%mur_up_Hx_ki, this%mur_up_Hx_kj)
+
+    end subroutine gpu_fused_mur_update_past_hx
+
+    ! Fused update_past for Hy: down, up, back, front
+    subroutine gpu_fused_mur_update_past_hy(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_update_past_hy_kernel( &
+          this%Hy_d, &
+          this%mur_past_Hy_down, this%mur_past_Hy_up, &
+          this%mur_past_Hy_back, this%mur_past_Hy_front, &
+          this%mur_down_Hy_ii, this%mur_down_Hy_ij, this%mur_down_Hy_ji, this%mur_down_Hy_jj, this%mur_down_Hy_ki, this%mur_down_Hy_kj, &
+          this%mur_up_Hy_ii, this%mur_up_Hy_ij, this%mur_up_Hy_ji, this%mur_up_Hy_jj, this%mur_up_Hy_ki, this%mur_up_Hy_kj, &
+          this%mur_back_Hy_ii, this%mur_back_Hy_ij, this%mur_back_Hy_ji, this%mur_back_Hy_jj, this%mur_back_Hy_ki, this%mur_back_Hy_kj, &
+          this%mur_front_Hy_ii, this%mur_front_Hy_ij, this%mur_front_Hy_ji, this%mur_front_Hy_jj, this%mur_front_Hy_ki, this%mur_front_Hy_kj)
+
+    end subroutine gpu_fused_mur_update_past_hy
+
+    ! Fused update_past for Hz: left, right, back, front
+    subroutine gpu_fused_mur_update_past_hz(this, b)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: b
+
+       if (.not. this%mur_initialized) return
+
+       call fused_mur_update_past_hz_kernel( &
+          this%Hz_d, &
+          this%mur_past_Hz_left, this%mur_past_Hz_right, &
+          this%mur_past_Hz_back, this%mur_past_Hz_front, &
+          this%mur_left_Hz_ii, this%mur_left_Hz_ij, this%mur_left_Hz_ji, this%mur_left_Hz_jj, this%mur_left_Hz_ki, this%mur_left_Hz_kj, &
+          this%mur_right_Hz_ii, this%mur_right_Hz_ij, this%mur_right_Hz_ji, this%mur_right_Hz_jj, this%mur_right_Hz_ki, this%mur_right_Hz_kj, &
+          this%mur_back_Hz_ii, this%mur_back_Hz_ij, this%mur_back_Hz_ji, this%mur_back_Hz_jj, this%mur_back_Hz_ki, this%mur_back_Hz_kj, &
+          this%mur_front_Hz_ii, this%mur_front_Hz_ij, this%mur_front_Hz_ji, this%mur_front_Hz_jj, this%mur_front_Hz_ki, this%mur_front_Hz_kj)
+
+    end subroutine gpu_fused_mur_update_past_hz
+
+    !===============================================================================
+    ! Fused advance Hx kernel — 4 boundaries: left, right, down, up
+    !===============================================================================
+    subroutine fused_mur_advance_hx_kernel(Hx_d, sggMiHx_d, &
+                                           past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up, &
+                                           l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, &
+                                           r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, &
+                                           d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, &
+                                           u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, &
+                                           left_CAB1, right_CAB1, down_CAB1, up_CAB1)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d
+       integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up
+       real(kind=rkind), device, dimension(:) :: left_CAB1, right_CAB1, down_CAB1, up_CAB1
+       integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj
+       integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj
+       integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj
+       integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj
+
+       integer(kind=4) :: i, j, k
+       integer(kind=integersizeofmediamatrices) :: medio
+
+       !$cuf kernel do(2) 
+       do k=l_ki,l_kj
+          do i=l_ii,l_ij
+             medio = sggMiHx_d(i, l_ji+1, k)
+             Hx_d(i, l_ji, k) = past_Hx_left(i, l_ji+1, k) + &
+                  left_CAB1(medio) * (Hx_d(i, l_ji+1, k) - past_Hx_left(i, l_ji, k))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=r_ki,r_kj
+          do i=r_ii,r_ij
+             medio = sggMiHx_d(i, r_ji-1, k)
+             Hx_d(i, r_ji, k) = past_Hx_right(i, r_ji-1, k) + &
+                  right_CAB1(medio) * (Hx_d(i, r_ji-1, k) - past_Hx_right(i, r_ji, k))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=d_ji,d_jj
+          do i=d_ii,d_ij
+             medio = sggMiHx_d(i, j, d_ki+1)
+             Hx_d(i, j, d_ki) = past_Hx_down(i, j, d_ki+1) + &
+                  down_CAB1(medio) * (Hx_d(i, j, d_ki+1) - past_Hx_down(i, j, d_ki))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=u_ji,u_jj
+          do i=u_ii,u_ij
+             medio = sggMiHx_d(i, j, u_kj-1)
+             Hx_d(i, j, u_kj) = past_Hx_up(i, j, u_kj-1) + &
+                  up_CAB1(medio) * (Hx_d(i, j, u_kj-1) - past_Hx_up(i, j, u_kj))
+          end do
+       end do
+
+    end subroutine fused_mur_advance_hx_kernel
+
+    !===============================================================================
+    ! Fused advance Hy kernel — 4 boundaries: down, up, back, front
+    !===============================================================================
+    subroutine fused_mur_advance_hy_kernel(Hy_d, sggMiHy_d, &
+                                           past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front, &
+                                           d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, &
+                                           u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, &
+                                           b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, &
+                                           f_ii, f_ij, f_ji, f_jj, f_ki, f_kj, &
+                                           down_CAB1, up_CAB1, back_CAB1, front_CAB1)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d
+       integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front
+       real(kind=rkind), device, dimension(:) :: down_CAB1, up_CAB1, back_CAB1, front_CAB1
+       integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj
+       integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj
+       integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj
+       integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj
+
+       integer(kind=4) :: i, j, k
+       integer(kind=integersizeofmediamatrices) :: medio
+
+       !$cuf kernel do(2) 
+       do j=d_ji,d_jj
+          do i=d_ii,d_ij
+             medio = sggMiHy_d(i, j, d_ki+1)
+             Hy_d(i, j, d_ki) = past_Hy_down(i, j, d_ki+1) + &
+                  down_CAB1(medio) * (Hy_d(i, j, d_ki+1) - past_Hy_down(i, j, d_ki))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=u_ji,u_jj
+          do i=u_ii,u_ij
+             medio = sggMiHy_d(i, j, u_kj-1)
+             Hy_d(i, j, u_kj) = past_Hy_up(i, j, u_kj-1) + &
+                  up_CAB1(medio) * (Hy_d(i, j, u_kj-1) - past_Hy_up(i, j, u_kj))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=b_ji,b_jj
+          do i=b_ii,b_ij
+             medio = sggMiHy_d(i, j, b_ki-1)
+             Hy_d(i, j, b_ki) = past_Hy_back(i, j, b_ki-1) + &
+                  back_CAB1(medio) * (Hy_d(i, j, b_ki-1) - past_Hy_back(i, j, b_ki))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=f_ji,f_jj
+          do i=f_ii,f_ij
+             medio = sggMiHy_d(i, j, f_kj+1)
+             Hy_d(i, j, f_kj) = past_Hy_front(i, j, f_kj+1) + &
+                  front_CAB1(medio) * (Hy_d(i, j, f_kj+1) - past_Hy_front(i, j, f_kj))
+          end do
+       end do
+
+    end subroutine fused_mur_advance_hy_kernel
+
+    !===============================================================================
+    ! Fused advance Hz kernel — 4 boundaries: left, right, back, front
+    !===============================================================================
+    subroutine fused_mur_advance_hz_kernel(Hz_d, sggMiHz_d, &
+                                           past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front, &
+                                           l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, &
+                                           r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, &
+                                           b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, &
+                                           f_ii, f_ij, f_ji, f_jj, f_ki, f_kj, &
+                                           left_CAB1, right_CAB1, back_CAB1, front_CAB1)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d
+       integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front
+       real(kind=rkind), device, dimension(:) :: left_CAB1, right_CAB1, back_CAB1, front_CAB1
+       integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj
+       integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj
+       integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj
+       integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj
+
+       integer(kind=4) :: i, j, k
+       integer(kind=integersizeofmediamatrices) :: medio
+
+       !$cuf kernel do(2) 
+       do k=l_ki,l_kj
+          do i=l_ii,l_ij
+             medio = sggMiHz_d(i, l_ji+1, k)
+             Hz_d(i, l_ji, k) = past_Hz_left(i, l_ji+1, k) + &
+                  left_CAB1(medio) * (Hz_d(i, l_ji+1, k) - past_Hz_left(i, l_ji, k))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=r_ki,r_kj
+          do i=r_ii,r_ij
+             medio = sggMiHz_d(i, r_ji-1, k)
+             Hz_d(i, r_ji, k) = past_Hz_right(i, r_ji-1, k) + &
+                  right_CAB1(medio) * (Hz_d(i, r_ji-1, k) - past_Hz_right(i, r_ji, k))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=b_ji,b_jj
+          do i=b_ii,b_ij
+             medio = sggMiHz_d(i, j, b_ki-1)
+             Hz_d(i, j, b_ki) = past_Hz_back(i, j, b_ki-1) + &
+                  back_CAB1(medio) * (Hz_d(i, j, b_ki-1) - past_Hz_back(i, j, b_ki))
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=f_ji,f_jj
+          do i=f_ii,f_ij
+             medio = sggMiHz_d(i, j, f_kj+1)
+             Hz_d(i, j, f_kj) = past_Hz_front(i, j, f_kj+1) + &
+                  front_CAB1(medio) * (Hz_d(i, j, f_kj+1) - past_Hz_front(i, j, f_kj))
+          end do
+       end do
+
+    end subroutine fused_mur_advance_hz_kernel
+
+    !===============================================================================
+    ! Fused update_past Hx kernel — 4 boundaries: left, right, down, up
+    !===============================================================================
+    subroutine fused_mur_update_past_hx_kernel(Hx_d, &
+                                               past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up, &
+                                               l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, &
+                                               r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, &
+                                               d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, &
+                                               u_ii, u_ij, u_ji, u_jj, u_ki, u_kj)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hx_left, past_Hx_right, past_Hx_down, past_Hx_up
+       integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj
+       integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj
+       integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj
+       integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj
+
+       integer(kind=4) :: i, j, k
+
+       !$cuf kernel do(2) 
+       do k=l_ki,l_kj
+          do i=l_ii,l_ij
+             past_Hx_left(i, l_ji, k) = Hx_d(i, l_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=r_ki,r_kj
+          do i=r_ii,r_ij
+             past_Hx_right(i, r_ji, k) = Hx_d(i, r_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=d_ki,d_kj
+          do i=d_ii,d_ij
+             past_Hx_down(i, d_ji, k) = Hx_d(i, d_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=u_ki,u_kj
+          do i=u_ii,u_ij
+             past_Hx_up(i, u_ji, k) = Hx_d(i, u_ji, k)
+          end do
+       end do
+
+    end subroutine fused_mur_update_past_hx_kernel
+
+    !===============================================================================
+    ! Fused update_past Hy kernel — 4 boundaries: down, up, back, front
+    !===============================================================================
+    subroutine fused_mur_update_past_hy_kernel(Hy_d, &
+                                               past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front, &
+                                               d_ii, d_ij, d_ji, d_jj, d_ki, d_kj, &
+                                               u_ii, u_ij, u_ji, u_jj, u_ki, u_kj, &
+                                               b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, &
+                                               f_ii, f_ij, f_ji, f_jj, f_ki, f_kj)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hy_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hy_down, past_Hy_up, past_Hy_back, past_Hy_front
+       integer(kind=4), intent(in) :: d_ii, d_ij, d_ji, d_jj, d_ki, d_kj
+       integer(kind=4), intent(in) :: u_ii, u_ij, u_ji, u_jj, u_ki, u_kj
+       integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj
+       integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj
+
+       integer(kind=4) :: i, j, k
+
+       !$cuf kernel do(2) 
+       do k=d_ki,d_kj
+          do i=d_ii,d_ij
+             past_Hy_down(i, d_ji, k) = Hy_d(i, d_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=u_ki,u_kj
+          do i=u_ii,u_ij
+             past_Hy_up(i, u_ji, k) = Hy_d(i, u_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=b_ji,b_jj
+          do i=b_ii,b_ij
+             past_Hy_back(i, j, b_kj) = Hy_d(i, j, b_kj)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=f_ji,f_jj
+          do i=f_ii,f_ij
+             past_Hy_front(i, j, f_kj) = Hy_d(i, j, f_kj)
+          end do
+       end do
+
+    end subroutine fused_mur_update_past_hy_kernel
+
+    !===============================================================================
+    ! Fused update_past Hz kernel — 4 boundaries: left, right, back, front
+    !===============================================================================
+    subroutine fused_mur_update_past_hz_kernel(Hz_d, &
+                                               past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front, &
+                                               l_ii, l_ij, l_ji, l_jj, l_ki, l_kj, &
+                                               r_ii, r_ij, r_ji, r_jj, r_ki, r_kj, &
+                                               b_ii, b_ij, b_ji, b_jj, b_ki, b_kj, &
+                                               f_ii, f_ij, f_ji, f_jj, f_ki, f_kj)
+
+       real(kind=rkind), device, dimension(:,:,:) :: Hz_d
+       real(kind=rkind), device, dimension(:,:,:) :: past_Hz_left, past_Hz_right, past_Hz_back, past_Hz_front
+       integer(kind=4), intent(in) :: l_ii, l_ij, l_ji, l_jj, l_ki, l_kj
+       integer(kind=4), intent(in) :: r_ii, r_ij, r_ji, r_jj, r_ki, r_kj
+       integer(kind=4), intent(in) :: b_ii, b_ij, b_ji, b_jj, b_ki, b_kj
+       integer(kind=4), intent(in) :: f_ii, f_ij, f_ji, f_jj, f_ki, f_kj
+
+       integer(kind=4) :: i, j, k
+
+       !$cuf kernel do(2) 
+       do k=l_ki,l_kj
+          do i=l_ii,l_ij
+             past_Hz_left(i, l_ji, k) = Hz_d(i, l_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do k=r_ki,r_kj
+          do i=r_ii,r_ij
+             past_Hz_right(i, r_ji, k) = Hz_d(i, r_ji, k)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=b_ji,b_jj
+          do i=b_ii,b_ij
+             past_Hz_back(i, j, b_kj) = Hz_d(i, j, b_kj)
+          end do
+       end do
+
+       !$cuf kernel do(2) 
+       do j=f_ji,f_jj
+          do i=f_ii,f_ij
+             past_Hz_front(i, j, f_kj) = Hz_d(i, j, f_kj)
+          end do
+       end do
+
+    end subroutine fused_mur_update_past_hz_kernel
+
+  end module gpu_mur_m
diff --git a/src_main_pub/gpu_nf2ff_m.F90 b/src_main_pub/gpu_nf2ff_m.F90
new file mode 100644
index 000000000..65ea86a74
--- /dev/null
+++ b/src_main_pub/gpu_nf2ff_m.F90
@@ -0,0 +1,66 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU NF2FF MODULE - CUDA Fortran (CUF) accelerated near-to-far-field
+!  Infrastructure for GPU-accelerated far-field pattern computation.
+!  NOTE: GPU kernel not yet implemented — falls through to CPU path.
+!  Implementation pending: full face-specific indexing from farfield.F90.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_nf2ff_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Device function: compute average of two complex values (geometric or arithmetic)
+   !--------------------------------------------------------------------------------
+   attributes(device) function AverageNF2FF(pasadas, z1, z2) result(z)
+      complex(kind=rkind), intent(in) :: z1, z2
+      integer, intent(in) :: pasadas
+      complex(kind=rkind) :: z
+      real(kind=rkind) :: phi1, phi2, nphi1, nphi2
+
+      z = (0.0_rkind, 0.0_rkind)
+      if (pasadas == 2) then
+         phi1 = atan2(imag(z1), real(z1))
+         phi2 = atan2(imag(z2), real(z2))
+         nphi1 = phi1
+         nphi2 = phi2
+         if ((phi1 < -pi/2.0_rkind) .and. (phi2 > pi/2.0_rkind)) nphi1 = phi1 - 2.0_rkind * pi
+         if ((phi2 < -pi/2.0_rkind) .and. (phi1 > pi/2.0_rkind)) nphi2 = phi2 - 2.0_rkind * pi
+         z = sqrt(abs(z1*z2)) * exp((0.0_rkind, 1.0_rkind) * (nphi1 + nphi2) / 2.0_rkind)
+      elseif (pasadas == 1) then
+         z = (z1 + z2) / 2.0_rkind
+      endif
+   end function AverageNF2FF
+
+   !--------------------------------------------------------------------------------
+   ! Public interface: GPU far-field flush
+   ! NOTE: GPU kernel not yet implemented. Falls through to CPU.
+   ! The gpu_state_t%nf2ff_initialized flag is set during init,
+   ! but the actual GPU computation is deferred until the full
+   ! face-specific indexing from farfield.F90 is implemented.
+   !--------------------------------------------------------------------------------
+   subroutine gpu_flush_nf2ff(this, Etheta_h, Ephi_h, RCS_h)
+      class(gpu_state_t), intent(inout) :: this
+      real(kind=rkind), dimension(:), intent(out) :: Etheta_h, Ephi_h, RCS_h
+
+      ! GPU kernel not yet implemented — CPU path handles this.
+      ! The nf2ff_initialized flag is set during init, but the actual
+      ! GPU flush is a no-op until full implementation.
+      ! TODO: Implement gpu_flush_nf2ff_kernel when NVHPC argument
+      ! limit is addressed (use derived type or split into multiple kernels).
+
+      ! Zero output buffers (CPU fallback will overwrite)
+      Etheta_h = 0.0_rkind
+      Ephi_h = 0.0_rkind
+      RCS_h = 0.0_rkind
+
+   end subroutine gpu_flush_nf2ff
+
+end module gpu_nf2ff_m
diff --git a/src_main_pub/gpu_sgbc_core_m.F90 b/src_main_pub/gpu_sgbc_core_m.F90
new file mode 100644
index 000000000..faa288270
--- /dev/null
+++ b/src_main_pub/gpu_sgbc_core_m.F90
@@ -0,0 +1,461 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU SGBC CORE MODULE - CUDA Fortran (CUF) accelerated SGBC
+!  Flattened arrays for all SGBC node data (no pointer indirection)
+!  Handles init/upload/download/destroy for non-dispersive SGBC only.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_sgbc_core_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+
+   implicit none
+
+   integer, parameter :: SGBC_MAX_DEPTH = 16
+
+   type gpu_state_sgbc_t
+      integer(kind=4) :: numNodes, maxDepth
+
+      ! Per-node constants (contiguous 1D arrays, indexed by node)
+      integer(kind=4), device, allocatable, dimension(:) :: d_node
+      real(kind=rkind), device, allocatable, dimension(:) :: gm2_ext
+      integer(kind=4), device, allocatable, dimension(:) :: jmed_node
+      real(kind=rkind), device, allocatable, dimension(:) :: transE, transH, alignedH
+      real(kind=rkind), device, allocatable, dimension(:) :: g1_val, g2a_val, g2b_val
+      real(kind=rkind), device, allocatable, dimension(:) :: gm2_externo
+      logical, device, allocatable, dimension(:) :: correct_ha, correct_hb, crank, filo_placa
+      integer(kind=4), device, allocatable, dimension(:) :: depth_node
+
+      ! Per-node state (contiguous 2D: [2*maxDepth+1, numNodes])
+      real(kind=rkind), device, allocatable, dimension(:,:) :: E_state, H_state, E_past_state
+      real(kind=rkind), device, allocatable, dimension(:,:) :: D_state
+      real(kind=rkind), device, allocatable, dimension(:,:) :: G1_int, G2_int, GM1_int, GM2_int
+      real(kind=rkind), device, allocatable, dimension(:,:) :: a_coef, b_coef, c_coef
+      real(kind=rkind), device, allocatable, dimension(:,:) :: rb_coef, rh_coef, rhm1_coef
+      integer(kind=4), device, allocatable, dimension(:,:) :: capa_state
+      real(kind=rkind), device, allocatable, dimension(:,:) :: delta_state
+
+      ! Per-node tridiag boundary constants
+      real(kind=rkind), device, allocatable, dimension(:) :: tridiag_a1, tridiag_b1, tridiag_c1
+      real(kind=rkind), device, allocatable, dimension(:) :: tridiag_an, tridiag_bn, tridiag_cn
+
+      ! Per-timestep field values (device, synced each step)
+      real(kind=rkind), device, allocatable, dimension(:) :: Efield_val, Ha_Plus_val, Ha_Minu_val
+      real(kind=rkind), device, allocatable, dimension(:) :: Hb_Plus_val, Hb_Minu_val
+      real(kind=rkind), device, allocatable, dimension(:) :: Hyee_left, Hyee_right
+
+      ! Host-side references for field sync
+      real(kind=rkind), pointer, dimension(:,:,:) :: Ex, Ey, Ez, Hx, Hy, Hz
+      integer(kind=4), pointer, dimension(:,:,:) :: Idxe, Idye, Idze, Idxh, Idyh, Idzh
+
+      logical :: initialized = .false.
+      logical :: fields_on_device = .false.
+   end type
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Initialize GPU SGBC state - called once at SGBC init
+   !--------------------------------------------------------------------------------
+   subroutine gpu_init_sgbc(this, numNodes, maxDepth, dt, sgbcFreq, SGBCcrank, SGBCDispersive)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      integer(kind=4), intent(in) :: numNodes, maxDepth
+      real(kind=rkind), intent(in) :: dt, sgbcFreq
+      logical, intent(in) :: SGBCcrank, SGBCDispersive
+
+      integer(kind=4) :: ndev, cuda_status, env_status
+      character(len=16) :: enable_cuf
+      integer(kind=4) :: i
+
+      call get_environment_variable("SEMBA_FDTD_ENABLE_CUF_RUNTIME", enable_cuf, status=env_status)
+      if (env_status /= 0 .or. trim(enable_cuf) /= "1") then
+         this%initialized = .false.
+         return
+      endif
+
+      cuda_status = cudaGetDeviceCount(ndev)
+      if (cuda_status /= cudaSuccess .or. ndev <= 0) then
+         this%initialized = .false.
+         return
+      endif
+
+      this%numNodes = numNodes
+      this%maxDepth = maxDepth
+
+      ! Allocate per-node constant arrays
+      allocate(this%d_node(numNodes))
+      allocate(this%gm2_ext(numNodes))
+      allocate(this%jmed_node(numNodes))
+      allocate(this%transE(numNodes))
+      allocate(this%transH(numNodes))
+      allocate(this%alignedH(numNodes))
+      allocate(this%g1_val(numNodes))
+      allocate(this%g2a_val(numNodes))
+      allocate(this%g2b_val(numNodes))
+      allocate(this%gm2_externo(numNodes))
+      allocate(this%correct_ha(numNodes))
+      allocate(this%correct_hb(numNodes))
+      allocate(this%crank(numNodes))
+      allocate(this%filo_placa(numNodes))
+      allocate(this%depth_node(numNodes))
+
+      ! Allocate per-node state arrays [2*maxDepth+1, numNodes]
+      allocate(this%E_state(2*maxDepth+1, numNodes))
+      allocate(this%H_state(2*maxDepth+1, numNodes))
+      allocate(this%E_past_state(2*maxDepth+1, numNodes))
+      allocate(this%D_state(2*maxDepth+1, numNodes))
+      allocate(this%G1_int(2*maxDepth+1, numNodes))
+      allocate(this%G2_int(2*maxDepth+1, numNodes))
+      allocate(this%GM1_int(2*maxDepth+1, numNodes))
+      allocate(this%GM2_int(2*maxDepth+1, numNodes))
+      allocate(this%a_coef(2*maxDepth+1, numNodes))
+      allocate(this%b_coef(2*maxDepth+1, numNodes))
+      allocate(this%c_coef(2*maxDepth+1, numNodes))
+      allocate(this%rb_coef(2*maxDepth+1, numNodes))
+      allocate(this%rh_coef(2*maxDepth+1, numNodes))
+      allocate(this%rhm1_coef(2*maxDepth+1, numNodes))
+      allocate(this%capa_state(2*maxDepth+1, numNodes))
+      allocate(this%delta_state(2*maxDepth+1, numNodes))
+
+      ! Allocate tridiag boundary constants
+      allocate(this%tridiag_a1(numNodes))
+      allocate(this%tridiag_b1(numNodes))
+      allocate(this%tridiag_c1(numNodes))
+      allocate(this%tridiag_an(numNodes))
+      allocate(this%tridiag_bn(numNodes))
+      allocate(this%tridiag_cn(numNodes))
+
+      ! Allocate per-timestep field arrays
+      allocate(this%Efield_val(numNodes))
+      allocate(this%Ha_Plus_val(numNodes))
+      allocate(this%Ha_Minu_val(numNodes))
+      allocate(this%Hb_Plus_val(numNodes))
+      allocate(this%Hb_Minu_val(numNodes))
+      allocate(this%Hyee_left(numNodes))
+      allocate(this%Hyee_right(numNodes))
+
+      ! Initialize to zero
+      this%E_state = 0.0_rkind
+      this%H_state = 0.0_rkind
+      this%E_past_state = 0.0_rkind
+      this%D_state = 0.0_rkind
+      this%G1_int = 0.0_rkind
+      this%G2_int = 0.0_rkind
+      this%GM1_int = 0.0_rkind
+      this%GM2_int = 0.0_rkind
+      this%a_coef = 0.0_rkind
+      this%b_coef = 0.0_rkind
+      this%c_coef = 0.0_rkind
+      this%rb_coef = 0.0_rkind
+      this%rh_coef = 0.0_rkind
+      this%rhm1_coef = 0.0_rkind
+      this%capa_state = 0
+      this%delta_state = 0.0_rkind
+
+      this%initialized = .true.
+      this%fields_on_device = .false.
+
+   end subroutine gpu_init_sgbc
+
+   !--------------------------------------------------------------------------------
+   ! Upload per-node constants from host to device
+   ! Called once after InitSGBCs completes
+   !--------------------------------------------------------------------------------
+   subroutine gpu_upload_sgbc_constants(this, depth_arr, gm2_ext_arr, jmed_arr, &
+                                        transE_arr, transH_arr, alignedH_arr, &
+                                        g1_arr, g2a_arr, g2b_arr, gm2_ext_val_arr, &
+                                        correct_ha_arr, correct_hb_arr, crank_arr, &
+                                        filo_placa_arr, depth_node_arr)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      integer(kind=4), intent(in) :: depth_arr(:), jmed_arr(:), depth_node_arr(:)
+      real(kind=rkind), intent(in) :: gm2_ext_arr(:), transE_arr(:), transH_arr(:), alignedH_arr(:)
+      real(kind=rkind), intent(in) :: g1_arr(:), g2a_arr(:), g2b_arr(:), gm2_ext_val_arr(:)
+      logical, intent(in) :: correct_ha_arr(:), correct_hb_arr(:), crank_arr(:), filo_placa_arr(:)
+
+      integer(kind=4) :: n
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      this%d_node(1:n) = depth_arr(1:n)
+      this%gm2_ext(1:n) = gm2_ext_arr(1:n)
+      this%jmed_node(1:n) = jmed_arr(1:n)
+      this%transE(1:n) = transE_arr(1:n)
+      this%transH(1:n) = transH_arr(1:n)
+      this%alignedH(1:n) = alignedH_arr(1:n)
+      this%g1_val(1:n) = g1_arr(1:n)
+      this%g2a_val(1:n) = g2a_arr(1:n)
+      this%g2b_val(1:n) = g2b_arr(1:n)
+      this%gm2_externo(1:n) = gm2_ext_val_arr(1:n)
+      this%correct_ha(1:n) = correct_ha_arr(1:n)
+      this%correct_hb(1:n) = correct_hb_arr(1:n)
+      this%crank(1:n) = crank_arr(1:n)
+      this%filo_placa(1:n) = filo_placa_arr(1:n)
+      this%depth_node(1:n) = depth_node_arr(1:n)
+
+   end subroutine gpu_upload_sgbc_constants
+
+   !--------------------------------------------------------------------------------
+   ! Upload per-node state arrays from host to device
+   ! Called once after InitSGBCs completes
+   !--------------------------------------------------------------------------------
+   subroutine gpu_upload_sgbc_state(this, E_arr, H_arr, E_past_arr, D_arr, &
+                                     G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, &
+                                     a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, &
+                                     capa_arr, delta_arr, &
+                                     tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, &
+                                     tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, &
+                                     Hyee_left_arr, Hyee_right_arr, &
+                                     offset)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(in) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:)
+      real(kind=rkind), intent(in) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:)
+      real(kind=rkind), intent(in) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:)
+      integer(kind=4), intent(in) :: capa_arr(:,:)
+      real(kind=rkind), intent(in) :: delta_arr(:,:)
+      real(kind=rkind), intent(in) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:)
+      real(kind=rkind), intent(in) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:)
+      real(kind=rkind), intent(in) :: Hyee_left_arr(:), Hyee_right_arr(:)
+      integer(kind=4), intent(in) :: offset  ! array offset (-depth)
+
+      integer(kind=4) :: n, sz, i, j
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      sz = 2*this%maxDepth + 1
+
+      do j = 1, n
+         do i = 1, sz
+            this%E_state(i, j) = E_arr(i + offset, j)
+            this%H_state(i, j) = H_arr(i + offset, j)
+            this%E_past_state(i, j) = E_past_arr(i + offset, j)
+            this%D_state(i, j) = D_arr(i + offset, j)
+            this%G1_int(i, j) = G1_int_arr(i + offset, j)
+            this%G2_int(i, j) = G2_int_arr(i + offset, j)
+            this%GM1_int(i, j) = GM1_int_arr(i + offset, j)
+            this%GM2_int(i, j) = GM2_int_arr(i + offset, j)
+            this%a_coef(i, j) = a_arr(i + offset, j)
+            this%b_coef(i, j) = b_arr(i + offset, j)
+            this%c_coef(i, j) = c_arr(i + offset, j)
+            this%rb_coef(i, j) = rb_arr(i + offset, j)
+            this%rh_coef(i, j) = rh_arr(i + offset, j)
+            this%rhm1_coef(i, j) = rhm1_arr(i + offset, j)
+            this%capa_state(i, j) = capa_arr(i + offset, j)
+            this%delta_state(i, j) = delta_arr(i + offset, j)
+         end do
+         this%tridiag_a1(j) = tridiag_a1_arr(j)
+         this%tridiag_b1(j) = tridiag_b1_arr(j)
+         this%tridiag_c1(j) = tridiag_c1_arr(j)
+         this%tridiag_an(j) = tridiag_an_arr(j)
+         this%tridiag_bn(j) = tridiag_bn_arr(j)
+         this%tridiag_cn(j) = tridiag_cn_arr(j)
+         this%Hyee_left(j) = Hyee_left_arr(j)
+         this%Hyee_right(j) = Hyee_right_arr(j)
+      end do
+
+      this%fields_on_device = .true.
+
+   end subroutine gpu_upload_sgbc_state
+
+   !--------------------------------------------------------------------------------
+   ! Upload per-timestep field values from host to device
+   ! Called before each SGBC E-advance
+   !--------------------------------------------------------------------------------
+   subroutine gpu_upload_sgbc_fields(this, Efield_arr, Ha_Plus_arr, Ha_Minu_arr, &
+                                      Hb_Plus_arr, Hb_Minu_arr)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(in) :: Efield_arr(:), Ha_Plus_arr(:), Ha_Minu_arr(:)
+      real(kind=rkind), intent(in) :: Hb_Plus_arr(:), Hb_Minu_arr(:)
+
+      integer(kind=4) :: n
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      this%Efield_val(1:n) = Efield_arr(1:n)
+      this%Ha_Plus_val(1:n) = Ha_Plus_arr(1:n)
+      this%Ha_Minu_val(1:n) = Ha_Minu_arr(1:n)
+      this%Hb_Plus_val(1:n) = Hb_Plus_arr(1:n)
+      this%Hb_Minu_val(1:n) = Hb_Minu_arr(1:n)
+
+   end subroutine gpu_upload_sgbc_fields
+
+   !--------------------------------------------------------------------------------
+   ! Download per-timestep field values from device to host
+   ! Called after each SGBC H-advance
+   !--------------------------------------------------------------------------------
+   subroutine gpu_download_sgbc_fields(this, Ha_Plus_arr, Ha_Minu_arr, &
+                                        Hb_Plus_arr, Hb_Minu_arr, &
+                                        Hyee_left_arr, Hyee_right_arr, &
+                                        Efield_arr)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(out) :: Ha_Plus_arr(:), Ha_Minu_arr(:)
+      real(kind=rkind), intent(out) :: Hb_Plus_arr(:), Hb_Minu_arr(:)
+      real(kind=rkind), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:)
+      real(kind=rkind), intent(out) :: Efield_arr(:)
+
+      integer(kind=4) :: n
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      Ha_Plus_arr(1:n) = this%Ha_Plus_val(1:n)
+      Ha_Minu_arr(1:n) = this%Ha_Minu_val(1:n)
+      Hb_Plus_arr(1:n) = this%Hb_Plus_val(1:n)
+      Hb_Minu_arr(1:n) = this%Hb_Minu_val(1:n)
+      Hyee_left_arr(1:n) = this%Hyee_left(1:n)
+      Hyee_right_arr(1:n) = this%Hyee_right(1:n)
+      Efield_arr(1:n) = this%Efield_val(1:n)
+
+   end subroutine gpu_download_sgbc_fields
+
+   !--------------------------------------------------------------------------------
+   ! Download per-node state from device to host
+   ! Called after each SGBC E-advance (for H-update) and at output times
+   !--------------------------------------------------------------------------------
+   subroutine gpu_download_sgbc_state(this, E_arr, H_arr, E_past_arr, D_arr, &
+                                       Hyee_left_arr, Hyee_right_arr, &
+                                       offset)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(out) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:)
+      real(kind=rkind), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:)
+      integer(kind=4), intent(in) :: offset  ! array offset (-depth)
+
+      integer(kind=4) :: n, sz, i, j
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      sz = 2*this%maxDepth + 1
+
+      do j = 1, n
+         do i = 1, sz
+            E_arr(i + offset, j) = this%E_state(i, j)
+            H_arr(i + offset, j) = this%H_state(i, j)
+            E_past_arr(i + offset, j) = this%E_past_state(i, j)
+            D_arr(i + offset, j) = this%D_state(i, j)
+         end do
+         Hyee_left_arr(j) = this%Hyee_left(j)
+         Hyee_right_arr(j) = this%Hyee_right(j)
+      end do
+
+      this%fields_on_device = .false.
+
+   end subroutine gpu_download_sgbc_state
+
+   !--------------------------------------------------------------------------------
+   ! Upload per-node coefficients from host to device
+   ! Called each timestep after calc_SGBCconstants
+   !--------------------------------------------------------------------------------
+   subroutine gpu_upload_sgbc_coeffs(this, G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, &
+                                      a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, &
+                                      tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, &
+                                      tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, &
+                                      offset)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(in) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:)
+      real(kind=rkind), intent(in) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:)
+      real(kind=rkind), intent(in) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:)
+      real(kind=rkind), intent(in) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:)
+      integer(kind=4), intent(in) :: offset
+
+      integer(kind=4) :: n, sz, i, j
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+      sz = 2*this%maxDepth + 1
+
+      do j = 1, n
+         do i = 1, sz
+            this%G1_int(i, j) = G1_int_arr(i + offset, j)
+            this%G2_int(i, j) = G2_int_arr(i + offset, j)
+            this%GM1_int(i, j) = GM1_int_arr(i + offset, j)
+            this%GM2_int(i, j) = GM2_int_arr(i + offset, j)
+            this%a_coef(i, j) = a_arr(i + offset, j)
+            this%b_coef(i, j) = b_arr(i + offset, j)
+            this%c_coef(i, j) = c_arr(i + offset, j)
+            this%rb_coef(i, j) = rb_arr(i + offset, j)
+            this%rh_coef(i, j) = rh_arr(i + offset, j)
+            this%rhm1_coef(i, j) = rhm1_arr(i + offset, j)
+         end do
+         this%tridiag_a1(j) = tridiag_a1_arr(j)
+         this%tridiag_b1(j) = tridiag_b1_arr(j)
+         this%tridiag_c1(j) = tridiag_c1_arr(j)
+         this%tridiag_an(j) = tridiag_an_arr(j)
+         this%tridiag_bn(j) = tridiag_bn_arr(j)
+         this%tridiag_cn(j) = tridiag_cn_arr(j)
+      end do
+
+   end subroutine gpu_upload_sgbc_coeffs
+
+   !--------------------------------------------------------------------------------
+   ! Destroy GPU SGBC state
+   !--------------------------------------------------------------------------------
+   subroutine gpu_destroy_sgbc(this)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+
+      if (.not. this%initialized) return
+
+      ! Deallocate device arrays
+      if (this%initialized) deallocate(this%d_node)
+      if (this%initialized) deallocate(this%gm2_ext)
+      if (this%initialized) deallocate(this%jmed_node)
+      if (this%initialized) deallocate(this%transE)
+      if (this%initialized) deallocate(this%transH)
+      if (this%initialized) deallocate(this%alignedH)
+      if (this%initialized) deallocate(this%g1_val)
+      if (this%initialized) deallocate(this%g2a_val)
+      if (this%initialized) deallocate(this%g2b_val)
+      if (this%initialized) deallocate(this%gm2_externo)
+      if (this%initialized) deallocate(this%correct_ha)
+      if (this%initialized) deallocate(this%correct_hb)
+      if (this%initialized) deallocate(this%crank)
+      if (this%initialized) deallocate(this%filo_placa)
+      if (this%initialized) deallocate(this%depth_node)
+
+      if (this%initialized) deallocate(this%E_state)
+      if (this%initialized) deallocate(this%H_state)
+      if (this%initialized) deallocate(this%E_past_state)
+      if (this%initialized) deallocate(this%D_state)
+      if (this%initialized) deallocate(this%G1_int)
+      if (this%initialized) deallocate(this%G2_int)
+      if (this%initialized) deallocate(this%GM1_int)
+      if (this%initialized) deallocate(this%GM2_int)
+      if (this%initialized) deallocate(this%a_coef)
+      if (this%initialized) deallocate(this%b_coef)
+      if (this%initialized) deallocate(this%c_coef)
+      if (this%initialized) deallocate(this%rb_coef)
+      if (this%initialized) deallocate(this%rh_coef)
+      if (this%initialized) deallocate(this%rhm1_coef)
+      if (this%initialized) deallocate(this%capa_state)
+      if (this%initialized) deallocate(this%delta_state)
+
+      if (this%initialized) deallocate(this%tridiag_a1)
+      if (this%initialized) deallocate(this%tridiag_b1)
+      if (this%initialized) deallocate(this%tridiag_c1)
+      if (this%initialized) deallocate(this%tridiag_an)
+      if (this%initialized) deallocate(this%tridiag_bn)
+      if (this%initialized) deallocate(this%tridiag_cn)
+
+      if (this%initialized) deallocate(this%Efield_val)
+      if (this%initialized) deallocate(this%Ha_Plus_val)
+      if (this%initialized) deallocate(this%Ha_Minu_val)
+      if (this%initialized) deallocate(this%Hb_Plus_val)
+      if (this%initialized) deallocate(this%Hb_Minu_val)
+      if (this%initialized) deallocate(this%Hyee_left)
+      if (this%initialized) deallocate(this%Hyee_right)
+
+      nullify(this%Ex); nullify(this%Ey); nullify(this%Ez)
+      nullify(this%Hx); nullify(this%Hy); nullify(this%Hz)
+      nullify(this%Idxe); nullify(this%Idye); nullify(this%Idze)
+      nullify(this%Idxh); nullify(this%Idyh); nullify(this%Idzh)
+
+      this%initialized = .false.
+      this%fields_on_device = .false.
+
+   end subroutine gpu_destroy_sgbc
+
+end module gpu_sgbc_core_m
diff --git a/src_main_pub/gpu_sgbc_e_m.F90 b/src_main_pub/gpu_sgbc_e_m.F90
new file mode 100644
index 000000000..3806a896d
--- /dev/null
+++ b/src_main_pub/gpu_sgbc_e_m.F90
@@ -0,0 +1,241 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU SGBC E-FIELD KERNELS - CUDA Fortran (CUF)
+!  Non-dispersive SGBC E-field advance (YEE + Crank-Nicolson)
+!  + Tridiagonal solver kernel (Thomas algorithm)
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_sgbc_e_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_sgbc_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Advance SGBC E-field - GPU accelerated
+   ! Each thread handles one node
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advance_sgbc_e(this, dt)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(in) :: dt
+
+      integer(kind=4) :: n
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+
+      call sgbc_e_kernel(this%E_state, this%H_state, this%E_past_state, this%D_state, &
+                         this%G1_int, this%G2_int, this%GM1_int, this%GM2_int, &
+                         this%a_coef, this%b_coef, this%c_coef, this%rb_coef, this%rh_coef, this%rhm1_coef, &
+                         this%tridiag_a1, this%tridiag_b1, this%tridiag_c1, &
+                         this%tridiag_an, this%tridiag_bn, this%tridiag_cn, &
+                         this%g1_val, this%g2a_val, this%g2b_val, &
+                         this%gm2_externo, this%transE, this%transH, this%alignedH, &
+                         this%correct_ha, this%correct_hb, this%crank, this%filo_placa, &
+                         this%depth_node, this%capa_state, this%delta_state, &
+                         this%Efield_val, this%Ha_Plus_val, this%Ha_Minu_val, &
+                         this%Hb_Plus_val, this%Hb_Minu_val, this%Hyee_left, this%Hyee_right, &
+                         n, dt)
+
+      this%fields_on_device = .true.
+
+   end subroutine gpu_advance_sgbc_e
+
+   !--------------------------------------------------------------------------------
+   ! Fused SGBC E-field advance kernel
+   ! Each thread handles one node
+   !--------------------------------------------------------------------------------
+   subroutine sgbc_e_kernel(E_state, H_state, E_past_state, D_state, &
+                            G1_int, G2_int, GM1_int, GM2_int, &
+                            a_coef, b_coef, c_coef, rb_coef, rh_coef, rhm1_coef, &
+                            tridiag_a1, tridiag_b1, tridiag_c1, &
+                            tridiag_an, tridiag_bn, tridiag_cn, &
+                            g1_val, g2a_val, g2b_val, &
+                            gm2_ext, transE, transH, alignedH, &
+                            correct_ha, correct_hb, crank, filo_placa, &
+                            depth_node, capa_state, delta_state, &
+                            Efield_val, Ha_Plus_val, Ha_Minu_val, &
+                            Hb_Plus_val, Hb_Minu_val, Hyee_left, Hyee_right, &
+                            n, dt)
+      integer(kind=4), intent(in) :: n
+      real(kind=rkind), intent(in) :: dt
+      real(kind=rkind), device, dimension(:,:) :: E_state, H_state, E_past_state, D_state
+      real(kind=rkind), device, dimension(:,:) :: G1_int, G2_int, GM1_int, GM2_int
+      real(kind=rkind), device, dimension(:,:) :: a_coef, b_coef, c_coef, rb_coef, rh_coef, rhm1_coef
+      real(kind=rkind), device, dimension(:) :: tridiag_a1, tridiag_b1, tridiag_c1
+      real(kind=rkind), device, dimension(:) :: tridiag_an, tridiag_bn, tridiag_cn
+      real(kind=rkind), device, dimension(:) :: g1_val, g2a_val, g2b_val, gm2_ext
+      real(kind=rkind), device, dimension(:) :: transE, transH, alignedH
+      logical, device, dimension(:) :: correct_ha, correct_hb, crank, filo_placa
+      integer(kind=4), device, dimension(:) :: depth_node
+      integer(kind=4), device, dimension(:,:) :: capa_state
+      real(kind=rkind), device, dimension(:,:) :: delta_state
+      real(kind=rkind), device, dimension(:) :: Efield_val, Ha_Plus_val, Ha_Minu_val
+      real(kind=rkind), device, dimension(:) :: Hb_Plus_val, Hb_Minu_val
+      real(kind=rkind), device, dimension(:) :: Hyee_left, Hyee_right
+
+      integer(kind=4) :: node, i, j, depth, sz
+      real(kind=rkind) :: g1, g2, g1i, g2i, gm1, gm2, delta_e
+      real(kind=rkind) :: Ha_Plus, Ha_Minu, Hb_Plus, Hb_Minu, Efield, Hyee_l, Hyee_r
+      logical :: ha, hb, cr, fp
+
+      !$cuf kernel do(1) <<<*, *>>>
+      do node = 1, n
+         depth = depth_node(node)
+         sz = 2*depth + 1
+         ha = correct_ha(node)
+         hb = correct_hb(node)
+         cr = crank(node)
+         fp = filo_placa(node)
+         gm2 = gm2_ext(node)
+         g1 = g1_val(node)
+         g2 = g2a_val(node)
+         Efield = Efield_val(node)
+         Ha_Plus = Ha_Plus_val(node)
+         Ha_Minu = Ha_Minu_val(node)
+         Hb_Plus = Hb_Plus_val(node)
+         Hb_Minu = Hb_Minu_val(node)
+         Hyee_l = Hyee_left(node)
+         Hyee_r = Hyee_right(node)
+
+         if (depth == 0) then
+            ! depth=0 case: simple boundary update
+            if (ha) then
+               E_state(1, node) = g1 * E_state(1, node) + &
+                  g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu)
+            elseif (hb) then
+               E_state(1, node) = g1 * E_state(1, node) + &
+                  g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu)
+            else
+               E_state(1, node) = g1 * E_state(1, node) + &
+                  g2 * (Ha_Plus - Ha_Minu) - g2b_val(node) * (Hb_Plus - Hb_Minu)
+            end if
+         else
+            ! depth>0 case: boundary + interior
+            ! Boundary cells: E(depth) and E(-depth)
+            if (.not. cr) then
+               ! YEE mode
+               if (ha) then
+                  E_state(depth + 1, node) = g1_val(1) * E_state(depth + 1, node) + &
+                     (g2a_val(1) * (Ha_Plus - Hyee_r) - g2b_val(1) * (Hb_Plus - Hb_Minu))
+                  E_state(1, node) = g1_val(0) * E_state(1, node) + &
+                     (g2a_val(0) * (Hyee_l - Ha_Minu) - g2b_val(0) * (Hb_Plus - Hb_Minu))
+               elseif (hb) then
+                  E_state(depth + 1, node) = g1_val(1) * E_state(depth + 1, node) + &
+                     (g2a_val(1) * (Ha_Plus - Ha_Minu) - g2b_val(1) * (Hb_Plus - Hyee_r))
+                  E_state(1, node) = g1_val(0) * E_state(1, node) + &
+                     (g2a_val(0) * (Ha_Plus - Ha_Minu) - g2b_val(0) * (Hyee_l - Hb_Minu))
+               end if
+
+               ! Interior cells
+               do i = 2, depth
+                  g1i = G1_int(i, node)
+                  g2i = G2_int(i, node)
+                  delta_e = 0.5_rkind * (delta_state(i, node) + delta_state(i-1, node))
+                  E_state(i, node) = g1i * E_state(i, node) + g2i / delta_e * &
+                     (H_state(i, node) - H_state(i-1, node))
+               end do
+            else
+               ! Crank-Nicolson mode
+               ! Copy E to E_past
+               do i = 1, sz
+                  E_past_state(i, node) = E_state(i, node)
+               end do
+
+               ! Compute D vector for boundaries
+               if (ha) then
+                  i = depth + 1
+                  D_state(i, node) = -tridiag_an(node) * E_past_state(i-1, node) + &
+                     tridiag_bn(node) * E_past_state(i, node) + &
+                     g2a_val(1) * (Ha_Plus - Hyee_r) - g2b_val(1) * (Hb_Plus - Hb_Minu)
+                  i = 1
+                  D_state(i, node) = -tridiag_c1(node) * E_past_state(i+1, node) + &
+                     tridiag_bn(node) * E_past_state(i, node) + &
+                     g2a_val(0) * (Hyee_l - Ha_Minu) - g2b_val(0) * (Hb_Plus - Hb_Minu)
+               elseif (hb) then
+                  i = depth + 1
+                  D_state(i, node) = -tridiag_an(node) * E_past_state(i-1, node) + &
+                     tridiag_bn(node) * E_past_state(i, node) + &
+                     g2a_val(1) * (Ha_Plus - Ha_Minu) - g2b_val(1) * (Hb_Plus - Hyee_r)
+                  i = 1
+                  D_state(i, node) = -tridiag_c1(node) * E_past_state(i+1, node) + &
+                     tridiag_bn(node) * E_past_state(i, node) + &
+                     g2a_val(0) * (Ha_Plus - Ha_Minu) - g2b_val(0) * (Hyee_l - Hb_Minu)
+               end if
+
+               ! Compute D vector for interior cells
+               do i = 2, depth
+                  D_state(i, node) = -a_coef(i, node) * E_past_state(i-1, node) - &
+                     c_coef(i, node) * E_past_state(i+1, node) + &
+                     rb_coef(i, node) * E_past_state(i, node) + &
+                     rh_coef(i, node) * H_state(i, node) - &
+                     rhm1_coef(i, node) * H_state(i-1, node)
+               end do
+
+               ! Solve tridiagonal system
+               call sgbc_solve_tridiag(D_state, E_state, a_coef, b_coef, c_coef, &
+                                       tridiag_a1(node), tridiag_b1(node), tridiag_c1(node), &
+                                       tridiag_an(node), tridiag_bn(node), tridiag_cn(node), &
+                                       sz, node)
+            end if
+         end if
+
+         ! Update Efield = average of boundary cells
+         Efield_val(node) = 0.5_rkind * (E_state(1, node) + E_state(depth + 1, node))
+      end do
+
+   end subroutine sgbc_e_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Tridiagonal solver (Thomas algorithm) - GPU kernel
+   ! Each thread solves one node's system
+   !--------------------------------------------------------------------------------
+   subroutine sgbc_solve_tridiag(d, x, a, b, c, a1, b1, c1, an, bn, cn, n, node)
+      integer(kind=4), intent(in) :: n, node
+      real(kind=rkind), device, dimension(:,:), intent(inout) :: d, x
+      real(kind=rkind), device, dimension(:,:), intent(in) :: a, b, c
+      real(kind=rkind), intent(in) :: a1, b1, c1, an, bn, cn
+
+      real(kind=rkind), dimension(:), allocatable, device :: aa, bb, cc
+      real(kind=rkind), dimension(:), allocatable, device :: cp, dp
+      real(kind=rkind) :: m
+      integer(kind=4) :: i
+
+      allocate(aa(n), bb(n), cc(n))
+      allocate(cp(n), dp(n))
+
+      aa(1) = a1
+      bb(1) = b1
+      cc(1) = c1
+      aa(n) = an
+      bb(n) = bn
+      cc(n) = cn
+      aa(2:n-1) = a(2:n-1, node)
+      bb(2:n-1) = b(2:n-1, node)
+      cc(2:n-1) = c(2:n-1, node)
+
+      ! Forward elimination
+      cp(1) = cc(1) / bb(1)
+      dp(1) = d(1, node) / bb(1)
+      do i = 2, n
+         m = bb(i) - cp(i-1) * aa(i)
+         cp(i) = cc(i) / m
+         dp(i) = (d(i, node) - dp(i-1) * aa(i)) / m
+      end do
+
+      ! Back substitution
+      x(n, node) = dp(n)
+      do i = n-1, 1, -1
+         x(i, node) = dp(i) - cp(i) * x(i+1, node)
+      end do
+
+      deallocate(aa, bb, cc, cp, dp)
+
+   end subroutine sgbc_solve_tridiag
+
+end module gpu_sgbc_e_m
diff --git a/src_main_pub/gpu_sgbc_h_m.F90 b/src_main_pub/gpu_sgbc_h_m.F90
new file mode 100644
index 000000000..e5838b37d
--- /dev/null
+++ b/src_main_pub/gpu_sgbc_h_m.F90
@@ -0,0 +1,134 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU SGBC H-FIELD KERNELS - CUDA Fortran (CUF)
+!  Non-dispersive SGBC H-field advance + boundary correction
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_sgbc_h_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_sgbc_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Advance SGBC H-field - GPU accelerated
+   ! Each thread handles one node
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advance_sgbc_h(this, dt)
+      class(gpu_state_sgbc_t), intent(inout) :: this
+      real(kind=rkind), intent(in) :: dt
+
+      integer(kind=4) :: n
+
+      if (.not. this%initialized) return
+
+      n = this%numNodes
+
+      call sgbc_h_kernel(this%H_state, this%E_state, this%E_past_state, &
+                         this%GM1_int, this%GM2_int, &
+                         this%crank, this%depth_node, &
+                         this%delta_state, &
+                         this%Ha_Plus_val, this%Ha_Minu_val, &
+                         this%Hb_Plus_val, this%Hb_Minu_val, &
+                         this%Efield_val, this%Hyee_left, this%Hyee_right, &
+                         this%gm2_ext, this%correct_ha, this%correct_hb, &
+                         n, dt)
+
+   end subroutine gpu_advance_sgbc_h
+
+   !--------------------------------------------------------------------------------
+   ! Fused SGBC H-field advance + boundary correction kernel
+   ! Each thread handles one node
+   !--------------------------------------------------------------------------------
+   subroutine sgbc_h_kernel(H_state, E_state, E_past_state, &
+                            GM1_int, GM2_int, &
+                            crank, depth_node, &
+                            delta_state, &
+                            Ha_Plus_val, Ha_Minu_val, &
+                            Hb_Plus_val, Hb_Minu_val, &
+                            Efield_val, Hyee_left, Hyee_right, &
+                            gm2_ext, correct_ha, correct_hb, &
+                            n, dt)
+      integer(kind=4), intent(in) :: n
+      real(kind=rkind), intent(in) :: dt
+      real(kind=rkind), device, dimension(:,:) :: H_state, E_state, E_past_state
+      real(kind=rkind), device, dimension(:,:) :: GM1_int, GM2_int
+      integer(kind=4), device, dimension(:,:) :: delta_state
+      logical, device, dimension(:) :: crank, correct_ha, correct_hb
+      integer(kind=4), device, dimension(:) :: depth_node
+      real(kind=rkind), device, dimension(:) :: Ha_Plus_val, Ha_Minu_val
+      real(kind=rkind), device, dimension(:) :: Hb_Plus_val, Hb_Minu_val
+      real(kind=rkind), device, dimension(:) :: Efield_val, Hyee_left, Hyee_right
+      real(kind=rkind), device, dimension(:) :: gm2_ext
+
+      integer(kind=4) :: node, i, depth, sz
+      real(kind=rkind) :: gm1, gm2, delta_e
+      real(kind=rkind) :: Ha_Plus, Ha_Minu, Hb_Plus, Hb_Minu, Efield
+      real(kind=rkind) :: E_i1, E_i, E_past_i1, E_past_i
+      logical :: cr, ha, hb
+
+      !$cuf kernel do(1) <<<*, *>>>
+      do node = 1, n
+         depth = depth_node(node)
+         sz = 2*depth + 1
+         cr = crank(node)
+         ha = correct_ha(node)
+         hb = correct_hb(node)
+
+         if (depth > 0) then
+            ! Update internal H cells
+            if (cr) then
+               ! Crank-Nicolson: half-step advance
+               do i = 1, depth
+                  gm1 = GM1_int(i, node)
+                  gm2 = GM2_int(i, node)
+                  E_i1 = E_state(i+1, node)
+                  E_i = E_state(i, node)
+                  E_past_i1 = E_past_state(i+1, node)
+                  E_past_i = E_past_state(i, node)
+                  H_state(i, node) = gm1 * H_state(i, node) + &
+                     0.5_rkind * gm2 * (E_i1 - E_i + E_past_i1 - E_past_i)
+               end do
+               ! Update Hyee values
+               H_state(1, node) = Hyee_left(node)
+               H_state(depth, node) = Hyee_right(node)
+            else
+               ! YEE mode
+               do i = 1, depth
+                  gm1 = GM1_int(i, node)
+                  gm2 = GM2_int(i, node)
+                  E_i1 = E_state(i+1, node)
+                  E_i = E_state(i, node)
+                  H_state(i, node) = gm1 * H_state(i, node) + gm2 * (E_i1 - E_i)
+               end do
+               Hyee_left(node) = H_state(1, node)
+               Hyee_right(node) = H_state(depth, node)
+            end if
+         end if
+
+         ! Boundary correction (AdvanceSGBCH)
+         Ha_Plus = Ha_Plus_val(node)
+         Ha_Minu = Ha_Minu_val(node)
+         Hb_Plus = Hb_Plus_val(node)
+         Hb_Minu = Hb_Minu_val(node)
+         Efield = Efield_val(node)
+
+         if (ha) then
+            Ha_Plus_val(node) = Ha_Plus + gm2_ext(node) * (Efield - E_state(depth + 1, node))
+            Ha_Minu_val(node) = Ha_Minu - gm2_ext(node) * (Efield - E_state(1, node))
+         elseif (hb) then
+            Hb_Plus_val(node) = Hb_Plus - gm2_ext(node) * (Efield - E_state(depth + 1, node))
+            Hb_Minu_val(node) = Hb_Minu + gm2_ext(node) * (Efield - E_state(1, node))
+         end if
+
+         ! Update Efield to average
+         Efield_val(node) = 0.5_rkind * (E_state(1, node) + E_state(depth + 1, node))
+      end do
+
+   end subroutine sgbc_h_kernel
+
+end module gpu_sgbc_h_m
diff --git a/src_main_pub/gpu_yee_m.F90 b/src_main_pub/gpu_yee_m.F90
new file mode 100644
index 000000000..2654ead01
--- /dev/null
+++ b/src_main_pub/gpu_yee_m.F90
@@ -0,0 +1,398 @@
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+!  GPU YEE KERNELS MODULE - CUDA Fortran (CUF) accelerated YEE kernels
+!  Fields stay on device between timesteps. Only kernel wrappers here.
+!  Split from gpu_kernels_cuf.F90 to avoid NVHPC compiler file-size limit.
+!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+
+module gpu_yee_m
+
+   use FDETYPES_m
+   use Report_m
+   use cudafor
+   use gpu_core_m
+
+   implicit none
+
+contains
+
+   !--------------------------------------------------------------------------------
+   ! Advance Ex field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceEx(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceEx_kernel(this%Ex_d, this%Hy_d, this%Hz_d, this%sggMiEx_d, &
+                                this%Idyh_d, this%Idzh_d, this%g1_d, this%g2_d, &
+                                bounds%sweepEx%NX, bounds%sweepEx%NY, bounds%sweepEx%NZ)
+
+   end subroutine gpu_advanceEx
+
+   subroutine gpu_advanceEx_kernel(Ex_d, Hy_d, Hz_d, sggMiEx_d, Idyh_d, Idzh_d, g1_d, g2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Hy_d, Hz_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d
+      real(kind=rkind), device, dimension(:) :: Idyh_d, Idzh_d, g1_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idzhk, Idyhj
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idzhk = Idzh_d(k)
+               Idyhj = Idyh_d(j)
+               medio = sggMiEx_d(i,j,k)
+               Ex_d(i,j,k) = g1_d(medio)*Ex_d(i,j,k) + g2_d(medio)* &
+                  ((Hz_d(i,j,k) - Hz_d(i,j-1,k))*Idyhj - (Hy_d(i,j,k) - Hy_d(i,j,k-1))*Idzhk)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceEx_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Advance Ey field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceEy(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceEy_kernel(this%Ey_d, this%Hz_d, this%Hx_d, this%sggMiEy_d, &
+                                this%Idzh_d, this%Idxh_d, this%g1_d, this%g2_d, &
+                                bounds%sweepEy%NX, bounds%sweepEy%NY, bounds%sweepEy%NZ)
+
+   end subroutine gpu_advanceEy
+
+   subroutine gpu_advanceEy_kernel(Ey_d, Hz_d, Hx_d, sggMiEy_d, Idzh_d, Idxh_d, g1_d, g2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Ey_d, Hz_d, Hx_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEy_d
+      real(kind=rkind), device, dimension(:) :: Idzh_d, Idxh_d, g1_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idzhk
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idzhk = Idzh_d(k)
+               medio = sggMiEy_d(i,j,k)
+               Ey_d(i,j,k) = g1_d(medio)*Ey_d(i,j,k) + g2_d(medio)* &
+                  ((Hx_d(i,j,k)-Hx_d(i,j,k-1))*Idzhk - (Hz_d(i,j,k)-Hz_d(i-1,j,k))*Idxh_d(i))
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceEy_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Advance Ez field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceEz(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceEz_kernel(this%Ez_d, this%Hx_d, this%Hy_d, this%sggMiEz_d, &
+                                this%Idyh_d, this%Idxh_d, this%g1_d, this%g2_d, &
+                                bounds%sweepEz%NX, bounds%sweepEz%NY, bounds%sweepEz%NZ)
+
+   end subroutine gpu_advanceEz
+
+   subroutine gpu_advanceEz_kernel(Ez_d, Hx_d, Hy_d, sggMiEz_d, Idyh_d, Idxh_d, g1_d, g2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Ez_d, Hx_d, Hy_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEz_d
+      real(kind=rkind), device, dimension(:) :: Idyh_d, Idxh_d, g1_d, g2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idyhj
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idyhj = Idyh_d(j)
+               medio = sggMiEz_d(i,j,k)
+               Ez_d(i,j,k) = g1_d(medio)*Ez_d(i,j,k) + g2_d(medio)* &
+                  ((Hy_d(i,j,k)-Hy_d(i-1,j,k))*Idxh_d(i) - (Hx_d(i,j,k)-Hx_d(i,j-1,k))*Idyhj)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceEz_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Advance Hx field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceHx(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceHx_kernel(this%Hx_d, this%Ey_d, this%Ez_d, this%sggMiHx_d, &
+                                this%Idye_d, this%Idze_d, this%gm1_d, this%gm2_d, &
+                                bounds%sweepHx%NX, bounds%sweepHx%NY, bounds%sweepHx%NZ)
+
+   end subroutine gpu_advanceHx
+
+   subroutine gpu_advanceHx_kernel(Hx_d, Ey_d, Ez_d, sggMiHx_d, Idye_d, Idze_d, gm1_d, gm2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Ey_d, Ez_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d
+      real(kind=rkind), device, dimension(:) :: Idye_d, Idze_d, gm1_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idzek, Idyej
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idzek = Idze_d(k)
+               Idyej = Idye_d(j)
+               medio = sggMiHx_d(i,j,k)
+               Hx_d(i,j,k) = gm1_d(medio)*Hx_d(i,j,k) + gm2_d(medio)* &
+                  ((Ey_d(i,j,k+1)-Ey_d(i,j,k))*Idzek - (Ez_d(i,j+1,k)-Ez_d(i,j,k))*Idyej)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceHx_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Advance Hy field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceHy(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceHy_kernel(this%Hy_d, this%Ez_d, this%Ex_d, this%sggMiHy_d, &
+                                this%Idze_d, this%Idxe_d, this%gm1_d, this%gm2_d, &
+                                bounds%sweepHy%NX, bounds%sweepHy%NY, bounds%sweepHy%NZ)
+
+   end subroutine gpu_advanceHy
+
+   subroutine gpu_advanceHy_kernel(Hy_d, Ez_d, Ex_d, sggMiHy_d, Idze_d, Idxe_d, gm1_d, gm2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Hy_d, Ez_d, Ex_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHy_d
+      real(kind=rkind), device, dimension(:) :: Idze_d, Idxe_d, gm1_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idzek
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idzek = Idze_d(k)
+               medio = sggMiHy_d(i,j,k)
+               Hy_d(i,j,k) = gm1_d(medio)*Hy_d(i,j,k) + gm2_d(medio)* &
+                  ((Ez_d(i+1,j,k)-Ez_d(i,j,k))*Idxe_d(i) - (Ex_d(i,j,k+1)-Ex_d(i,j,k))*Idzek)
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceHy_kernel
+
+   !--------------------------------------------------------------------------------
+   ! Advance Hz field - GPU accelerated YEE kernel (fields on device, no copy)
+   !--------------------------------------------------------------------------------
+   subroutine gpu_advanceHz(this, bounds)
+      class(gpu_state_t), intent(inout) :: this
+      type(bounds_t), intent(in) :: bounds
+
+      if (.not. this%initialized) return
+
+      call gpu_advanceHz_kernel(this%Hz_d, this%Ex_d, this%Ey_d, this%sggMiHz_d, &
+                                this%Idye_d, this%Idxe_d, this%gm1_d, this%gm2_d, &
+                                bounds%sweepHz%NX, bounds%sweepHz%NY, bounds%sweepHz%NZ)
+
+   end subroutine gpu_advanceHz
+
+   subroutine gpu_advanceHz_kernel(Hz_d, Ex_d, Ey_d, sggMiHz_d, Idye_d, Idxe_d, gm1_d, gm2_d, nx, ny, nz)
+      integer(kind=4), intent(in) :: nx, ny, nz
+      real(kind=rkind), device, dimension(:,:,:) :: Hz_d, Ex_d, Ey_d
+      integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHz_d
+      real(kind=rkind), device, dimension(:) :: Idye_d, Idxe_d, gm1_d, gm2_d
+
+      integer(kind=4) :: i, j, k
+      real(kind=rkind) :: Idyej
+      integer(kind=integersizeofmediamatrices) :: medio
+
+      !$cuf kernel do(3) <<<*, *>>>
+      do k=1,nz
+         do j=1,ny
+            do i=1,nx
+               Idyej = Idye_d(j)
+               medio = sggMiHz_d(i,j,k)
+               Hz_d(i,j,k) = gm1_d(medio)*Hz_d(i,j,k) + gm2_d(medio)* &
+                  ((Ex_d(i,j+1,k)-Ex_d(i,j,k))*Idyej - (Ey_d(i+1,j,k)-Ey_d(i,j,k))*Idxe_d(i))
+            end do
+         end do
+      end do
+   end subroutine gpu_advanceHz_kernel
+
+   !--------------------------------------------------------------------------------
+    ! Fused E-field YEE kernel — updates Ex, Ey, Ez in single kernel launch
+    !--------------------------------------------------------------------------------
+    subroutine gpu_advanceE_all(this, bounds)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: bounds
+
+       if (.not. this%initialized) return
+
+       call gpu_advanceE_all_kernel(this%Ex_d, this%Ey_d, this%Ez_d, &
+                                     this%Hx_d, this%Hy_d, this%Hz_d, &
+                                     this%sggMiEx_d, this%sggMiEy_d, this%sggMiEz_d, &
+                                     this%Idxh_d, this%Idxe_d, &
+                                     this%Idyh_d, this%Idzh_d, this%Idze_d, this%Idye_d, &
+                                     this%g1_d, this%g2_d, &
+                                     bounds%sweepEx%NX, bounds%sweepEx%NY, bounds%sweepEx%NZ, &
+                                     bounds%sweepEy%NX, bounds%sweepEy%NY, bounds%sweepEy%NZ, &
+                                     bounds%sweepEz%NX, bounds%sweepEz%NY, bounds%sweepEz%NZ)
+
+    end subroutine gpu_advanceE_all
+
+    subroutine gpu_advanceE_all_kernel(Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d, &
+                                       sggMiEx_d, sggMiEy_d, sggMiEz_d, &
+                                       Idxh_d, Idxe_d, &
+                                       Idyh_d, Idzh_d, Idze_d, Idye_d, &
+                                       g1_d, g2_d, &
+                                       Ex_nx, Ex_ny, Ex_nz, &
+                                       Ey_nx, Ey_ny, Ey_nz, &
+                                       Ez_nx, Ez_ny, Ez_nz)
+       integer(kind=4), intent(in) :: Ex_nx, Ex_ny, Ex_nz, Ey_nx, Ey_ny, Ey_nz, Ez_nx, Ez_ny, Ez_nz
+       real(kind=rkind), device, dimension(:,:,:) :: Ex_d, Ey_d, Ez_d, Hx_d, Hy_d, Hz_d
+       integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiEx_d, sggMiEy_d, sggMiEz_d
+       real(kind=rkind), device, dimension(:) :: Idxh_d, Idxe_d, Idyh_d, Idzh_d, Idze_d, Idye_d, g1_d, g2_d
+
+       integer(kind=4) :: i, j, k
+       real(kind=rkind) :: Idzhk, Idyhj, Idzhk2, Idxh, Idzek, Idyej, Idzek2, Idyej2
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Ex_nz
+          do j=1,Ex_ny
+             do i=1,Ex_nx
+                Idzhk = Idzh_d(k)
+                Idyhj = Idyh_d(j)
+                Ex_d(i,j,k) = g1_d(sggMiEx_d(i,j,k))*Ex_d(i,j,k) + g2_d(sggMiEx_d(i,j,k))* &
+                   ((Hz_d(i,j,k) - Hz_d(i,j-1,k))*Idyhj - (Hy_d(i,j,k) - Hy_d(i,j,k-1))*Idzhk)
+             end do
+          end do
+       end do
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Ey_nz
+          do j=1,Ey_ny
+             do i=1,Ey_nx
+                Idzhk2 = Idzh_d(k)
+                Idxh = Idxh_d(i)
+                Ey_d(i,j,k) = g1_d(sggMiEy_d(i,j,k))*Ey_d(i,j,k) + g2_d(sggMiEy_d(i,j,k))* &
+                   ((Hx_d(i,j,k)-Hx_d(i,j,k-1))*Idzhk2 - (Hz_d(i,j,k)-Hz_d(i-1,j,k))*Idxh)
+             end do
+          end do
+       end do
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Ez_nz
+          do j=1,Ez_ny
+             do i=1,Ez_nx
+                Idyej2 = Idye_d(j)
+                Idzek2 = Idze_d(k)
+                Ez_d(i,j,k) = g1_d(sggMiEz_d(i,j,k))*Ez_d(i,j,k) + g2_d(sggMiEz_d(i,j,k))* &
+                   ((Hy_d(i,j,k)-Hy_d(i-1,j,k))*Idxh_d(i) - (Hx_d(i,j,k)-Hx_d(i,j-1,k))*Idyej2)
+             end do
+          end do
+       end do
+
+    end subroutine gpu_advanceE_all_kernel
+
+    !--------------------------------------------------------------------------------
+    ! Fused H-field YEE kernel — updates Hx, Hy, Hz in single kernel launch
+    !--------------------------------------------------------------------------------
+    subroutine gpu_advanceH_all(this, bounds)
+       class(gpu_state_t), intent(inout) :: this
+       type(bounds_t), intent(in) :: bounds
+
+       if (.not. this%initialized) return
+
+      call gpu_advanceH_all_kernel(this%Hx_d, this%Hy_d, this%Hz_d, &
+                                     this%Ey_d, this%Ez_d, this%Ex_d, &
+                                     this%sggMiHx_d, this%sggMiHy_d, this%sggMiHz_d, &
+                                     this%Idye_d, this%Idze_d, &
+                                     this%Idyh_d, this%Idzh_d, this%Idxh_d, this%Idxe_d, &
+                                     this%gm1_d, this%gm2_d, &
+                                     bounds%sweepHx%NX, bounds%sweepHx%NY, bounds%sweepHx%NZ, &
+                                     bounds%sweepHy%NX, bounds%sweepHy%NY, bounds%sweepHy%NZ, &
+                                     bounds%sweepHz%NX, bounds%sweepHz%NY, bounds%sweepHz%NZ)
+
+    end subroutine gpu_advanceH_all
+
+    subroutine gpu_advanceH_all_kernel(Hx_d, Hy_d, Hz_d, Ex_d, Ey_d, Ez_d, &
+                                       sggMiHx_d, sggMiHy_d, sggMiHz_d, &
+                                       Idye_d, Idze_d, &
+                                       Idyh_d, Idzh_d, Idxh_d, Idxe_d, &
+                                       gm1_d, gm2_d, &
+                                       Hx_nx, Hx_ny, Hx_nz, &
+                                       Hy_nx, Hy_ny, Hy_nz, &
+                                       Hz_nx, Hz_ny, Hz_nz)
+       integer(kind=4), intent(in) :: Hx_nx, Hx_ny, Hx_nz, Hy_nx, Hy_ny, Hy_nz, Hz_nx, Hz_ny, Hz_nz
+       real(kind=rkind), device, dimension(:,:,:) :: Hx_d, Hy_d, Hz_d, Ex_d, Ey_d, Ez_d
+       integer(kind=integersizeofmediamatrices), device, dimension(:,:,:) :: sggMiHx_d, sggMiHy_d, sggMiHz_d
+       real(kind=rkind), device, dimension(:) :: Idye_d, Idze_d, Idyh_d, Idzh_d, Idxh_d, Idxe_d, gm1_d, gm2_d
+
+       integer(kind=4) :: i, j, k
+       real(kind=rkind) :: Idzek, Idyej, Idzek2, Idxh, Idyej2, Idxh2
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Hx_nz
+          do j=1,Hx_ny
+             do i=1,Hx_nx
+                Idzek = Idze_d(k)
+                Idyej = Idye_d(j)
+                Hx_d(i,j,k) = gm1_d(sggMiHx_d(i,j,k))*Hx_d(i,j,k) + gm2_d(sggMiHx_d(i,j,k))* &
+                   ((Ey_d(i,j,k+1)-Ey_d(i,j,k))*Idzek - (Ez_d(i,j+1,k)-Ez_d(i,j,k))*Idyej)
+             end do
+          end do
+       end do
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Hy_nz
+          do j=1,Hy_ny
+             do i=1,Hy_nx
+                Idzek2 = Idze_d(k)
+                Idxh = Idxh_d(i)
+                Hy_d(i,j,k) = gm1_d(sggMiHy_d(i,j,k))*Hy_d(i,j,k) + gm2_d(sggMiHy_d(i,j,k))* &
+                   ((Ez_d(i+1,j,k)-Ez_d(i,j,k))*Idxh - (Ex_d(i,j,k+1)-Ex_d(i,j,k))*Idzek2)
+             end do
+          end do
+       end do
+
+       !$cuf kernel do(3) <<<*, *>>>
+       do k=1,Hz_nz
+          do j=1,Hz_ny
+             do i=1,Hz_nx
+                Idyej2 = Idye_d(j)
+                Idxh2 = Idxh_d(i)
+                Hz_d(i,j,k) = gm1_d(sggMiHz_d(i,j,k))*Hz_d(i,j,k) + gm2_d(sggMiHz_d(i,j,k))* &
+                   ((Ex_d(i,j+1,k)-Ex_d(i,j,k))*Idyej2 - (Ey_d(i+1,j,k)-Ey_d(i,j,k))*Idxh2)
+             end do
+          end do
+       end do
+
+    end subroutine gpu_advanceH_all_kernel
+
+end module gpu_yee_m
diff --git a/src_main_pub/maloney_nostoch.F90 b/src_main_pub/maloney_nostoch.F90
index 10226019d..fa616aa52 100755
--- a/src_main_pub/maloney_nostoch.F90
+++ b/src_main_pub/maloney_nostoch.F90
@@ -87,9 +87,151 @@ module SGBC_nostoch_m
 !!!
 public Malon_t,SGBCSurface_t !el tipo es publico
 public AdvanceSGBCE,AdvanceSGBCH,InitSGBCs,DestroySGBCs,StoreFieldsSGBCs,calc_SGBCconstants,GetSGBCs
+public GetSGBCConstants,GetSGBCState,GetSGBCFieldPointers
+public malon
 
 contains
 
+!--------------------------------------------------------------------------------
+! Export per-node constants for GPU
+!--------------------------------------------------------------------------------
+subroutine GetSGBCConstants(depth_arr, gm2_ext_arr, jmed_arr, transE_arr, transH_arr, alignedH_arr, &
+                             g1_arr, g2a_arr, g2b_arr, gm2_ext_val_arr, &
+                             correct_ha_arr, correct_hb_arr, crank_arr, filo_placa_arr, &
+                             depth_node_arr, G1_int_arr, G2_int_arr, GM1_int_arr, GM2_int_arr, &
+                             a_arr, b_arr, c_arr, rb_arr, rh_arr, rhm1_arr, &
+                             tridiag_a1_arr, tridiag_b1_arr, tridiag_c1_arr, &
+                             tridiag_an_arr, tridiag_bn_arr, tridiag_cn_arr, &
+                             capa_arr, delta_arr, Hyee_left_arr, Hyee_right_arr, &
+                             numNodes_out, maxDepth_out)
+   integer(kind=4), intent(out) :: depth_arr(:), jmed_arr(:), depth_node_arr(:)
+   real(kind=RKIND), intent(out) :: gm2_ext_arr(:), transE_arr(:), transH_arr(:), alignedH_arr(:)
+   real(kind=RKIND), intent(out) :: g1_arr(:), g2a_arr(:), g2b_arr(:), gm2_ext_val_arr(:)
+   logical, intent(out) :: correct_ha_arr(:), correct_hb_arr(:), crank_arr(:), filo_placa_arr(:)
+   real(kind=RKIND), intent(out) :: G1_int_arr(:,:), G2_int_arr(:,:), GM1_int_arr(:,:), GM2_int_arr(:,:)
+   real(kind=RKIND), intent(out) :: a_arr(:,:), b_arr(:,:), c_arr(:,:), rb_arr(:,:), rh_arr(:,:), rhm1_arr(:,:)
+   real(kind=RKIND), intent(out) :: tridiag_a1_arr(:), tridiag_b1_arr(:), tridiag_c1_arr(:)
+   real(kind=RKIND), intent(out) :: tridiag_an_arr(:), tridiag_bn_arr(:), tridiag_cn_arr(:)
+   integer(kind=4), intent(out) :: capa_arr(:,:)
+   real(kind=RKIND), intent(out) :: delta_arr(:,:)
+   real(kind=RKIND), intent(out) :: Hyee_left_arr(:), Hyee_right_arr(:)
+   integer(kind=4), intent(out) :: numNodes_out, maxDepth_out
+
+   type(SGBCSurface_t), pointer :: compo
+   integer(kind=4) :: conta, i, maxD
+
+   maxD = 0
+   do conta = 1, malon%NumNodes
+      compo => malon%Nodes(conta)
+      if (compo%depth > maxD) maxD = compo%depth
+   end do
+   maxD = max(maxD, 1)
+
+   numNodes_out = malon%NumNodes
+   maxDepth_out = maxD
+
+   do conta = 1, malon%NumNodes
+      compo => malon%Nodes(conta)
+      depth_arr(conta) = 1
+      jmed_arr(conta) = compo%jmed
+      depth_node_arr(conta) = compo%depth
+      gm2_ext_arr(conta) = compo%GM2_externo
+      transE_arr(conta) = compo%transversalDeltaE
+      transH_arr(conta) = compo%transversalDeltaH
+      alignedH_arr(conta) = compo%alignedlDeltaH
+      g1_arr(conta) = compo%g1(0)
+      g2a_arr(conta) = compo%g2a(0)
+      g2b_arr(conta) = compo%g2b(0)
+      gm2_ext_val_arr(conta) = compo%GM2_externo
+      correct_ha_arr(conta) = compo%correct_ha
+      correct_hb_arr(conta) = compo%correct_hb
+      crank_arr(conta) = compo%SGBCCrank
+      filo_placa_arr(conta) = compo%es_unfilo_placa
+
+      ! Copy per-node state arrays
+      do i = -compo%depth, compo%depth
+         G1_int_arr(i + compo%depth + 1, conta) = compo%G1_interno(i)
+         G2_int_arr(i + compo%depth + 1, conta) = compo%G2_interno(i)
+         GM1_int_arr(i + compo%depth + 1, conta) = compo%GM1_interno(i)
+         GM2_int_arr(i + compo%depth + 1, conta) = compo%GM2_interno(i)
+         a_arr(i + compo%depth + 1, conta) = compo%a(i)
+         b_arr(i + compo%depth + 1, conta) = compo%b(i)
+         c_arr(i + compo%depth + 1, conta) = compo%c(i)
+         rb_arr(i + compo%depth + 1, conta) = compo%rb(i)
+         rh_arr(i + compo%depth + 1, conta) = compo%rh(i)
+         rhm1_arr(i + compo%depth + 1, conta) = compo%rhm1(i)
+         capa_arr(i + compo%depth + 1, conta) = compo%capa(i)
+         delta_arr(i + compo%depth + 1, conta) = compo%delta_entreEinterno(i)
+      end do
+
+      ! Tridiag boundary constants
+      tridiag_a1_arr(conta) = compo%a1
+      tridiag_b1_arr(conta) = compo%b1
+      tridiag_c1_arr(conta) = compo%c1
+      tridiag_an_arr(conta) = compo%an
+      tridiag_bn_arr(conta) = compo%bn
+      tridiag_cn_arr(conta) = compo%cn
+      Hyee_left_arr(conta) = compo%Hyee__left
+      Hyee_right_arr(conta) = compo%Hyee_right
+   end do
+
+end subroutine GetSGBCConstants
+
+!--------------------------------------------------------------------------------
+! Export per-node state (E, H, E_past, D) for GPU
+!--------------------------------------------------------------------------------
+subroutine GetSGBCState(E_arr, H_arr, E_past_arr, D_arr, offset)
+   real(kind=RKIND), intent(out) :: E_arr(:,:), H_arr(:,:), E_past_arr(:,:), D_arr(:,:)
+   integer(kind=4), intent(in) :: offset
+
+   type(SGBCSurface_t), pointer :: compo
+   integer(kind=4) :: conta, i
+
+   do conta = 1, malon%NumNodes
+      compo => malon%Nodes(conta)
+      do i = -compo%depth, compo%depth
+         E_arr(i + offset, conta) = compo%E(i)
+         E_past_arr(i + offset, conta) = compo%E_past(i)
+      end do
+      if (compo%depth > 0) then
+         do i = -compo%depth, compo%depth-1
+            H_arr(i + offset, conta) = compo%H(i)
+            D_arr(i + offset, conta) = compo%d(i)
+         end do
+      end if
+      ! Zero out D for depth=0 nodes
+      if (compo%depth == 0) then
+         D_arr(1 + offset, conta) = 0.0_RKIND
+      end if
+   end do
+
+end subroutine GetSGBCState
+
+!--------------------------------------------------------------------------------
+! Export per-timestep field pointers for GPU upload
+!--------------------------------------------------------------------------------
+subroutine GetSGBCFieldPointers(Efield_arr, Ha_Plus_arr, Ha_Minu_arr, &
+                                  Hb_Plus_arr, Hb_Minu_arr, numNodes_out)
+   real(kind=RKIND), intent(out) :: Efield_arr(:), Ha_Plus_arr(:), Ha_Minu_arr(:)
+   real(kind=RKIND), intent(out) :: Hb_Plus_arr(:), Hb_Minu_arr(:)
+   integer(kind=4), intent(out) :: numNodes_out
+
+   type(SGBCSurface_t), pointer :: compo
+   integer(kind=4) :: conta
+
+   numNodes_out = malon%NumNodes
+   do conta = 1, malon%NumNodes
+      compo => malon%Nodes(conta)
+      Efield_arr(conta) = compo%Efield
+      Ha_Plus_arr(conta) = compo%Ha_Plus
+      Ha_Minu_arr(conta) = compo%Ha_Minu
+      Hb_Plus_arr(conta) = compo%Hb_Plus
+      Hb_Minu_arr(conta) = compo%Hb_Minu
+   end do
+
+end subroutine GetSGBCFieldPointers
+
+
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
 ! subroutine to initialize the parameters
 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
diff --git a/src_main_pub/observation.F90 b/src_main_pub/observation.F90
index 5d8db95e2..22f23706d 100755
--- a/src_main_pub/observation.F90
+++ b/src_main_pub/observation.F90
@@ -108,8 +108,8 @@ module Observa_m
   real(kind=RKIND), pointer, dimension(:), save  :: InvEps, InvMu
   type(output_t), pointer, dimension(:), save  :: output
 
-  public InitObservation, FlushObservationFiles, UpdateObservation, DestroyObservation, CloseObservationFiles, unpacksinglefiles, &
-    GetOutput, preprocess_observation
+public InitObservation, FlushObservationFiles, UpdateObservation, DestroyObservation, CloseObservationFiles, unpacksinglefiles, &
+     GetOutput, preprocess_observation, UpdateProbeResultsFromGPU
   public output_t, item_t, Serialized_t, dtft
   !Required at least in tests
   public fieldo
@@ -3686,12 +3686,61 @@ subroutine updateJ(field, jdir)
         end select
       end subroutine
 
-    end subroutine UpdateObservation
+end subroutine UpdateObservation
+
+    !--------------------------------------------------------------------------------
+    ! Update probe results from GPU-sampled values
+    ! Called when GPU samples probes directly instead of using UpdateObservation
+    !--------------------------------------------------------------------------------
+    subroutine UpdateProbeResultsFromGPU(sgg, nTime, nInit, pointResults, blockResults, nPointProbes, nBlockProbes)
+       type(SGGFDTDINFO_t), intent(in) :: sgg
+       integer(kind=4), intent(in) :: nTime, nInit
+       real(kind=RKIND), dimension(:), intent(in) :: pointResults
+       real(kind=RKIND), dimension(:), intent(in) :: blockResults
+       integer(kind=4), intent(in) :: nPointProbes, nBlockProbes
+       integer(kind=4) :: ii, i, idx, nTimeOffset
+       integer(kind=4) :: pointObservationCases(6), blockObservationCases(6)
+
+       pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz]
+       blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz]
+       nTimeOffset = nTime - nInit
+
+       ! Update point probe results
+       if (nPointProbes > 0) then
+          idx = 0
+          do ii = 1, sgg%NumberRequest
+             if (.not. sgg%Observation(ii)%TimeDomain) cycle
+             do i = 1, sgg%Observation(ii)%nP
+                if (sgg%Observation(ii)%P(i)%what == nothing) cycle
+                if (any(sgg%Observation(ii)%P(i)%what == pointObservationCases)) then
+                   idx = idx + 1
+                   output(ii)%item(i)%valor(nTimeOffset) = pointResults(idx)
+                end if
+             end do
+          end do
+       end if
+
+       ! Update block probe results
+       if (nBlockProbes > 0) then
+          idx = 0
+          do ii = 1, sgg%NumberRequest
+             if (.not. sgg%Observation(ii)%TimeDomain) cycle
+             do i = 1, sgg%Observation(ii)%nP
+                if (sgg%Observation(ii)%P(i)%what == nothing) cycle
+                if (any(sgg%Observation(ii)%P(i)%what == blockObservationCases)) then
+                   idx = idx + 1
+                   output(ii)%item(i)%valor(nTimeOffset) = blockResults(idx)
+                end if
+             end do
+          end do
+       end if
 
-   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-   !!! Flushes the observed magnitudes to disk
-   !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
-   subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs, dxe,dye,dze,dxh,dyh,dzh,b,singlefilewrite,facesNF2FF,flushff)
+    end subroutine UpdateProbeResultsFromGPU
+
+    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+    !!! Flushes the observed magnitudes to disk
+    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+    subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs, dxe,dye,dze,dxh,dyh,dzh,b,singlefilewrite,facesNF2FF,flushff)
       use ilumina_m !is needed to also calculate the incident field in the observed points
       !solo lo precisa de entrada farfield
       type(bounds_t) :: b
@@ -3716,7 +3765,7 @@ subroutine FlushObservationFiles(sgg,nInit,FinalInstant,layoutnumber,num_procs,
       logical :: called_fromobservation, dummy_logical
       integer :: my_iostat
 
-      character(len=BUFSIZE) :: whoami
+       character(len=BUFSIZE) :: whoami
       !!!
       write (whoami, '(a,i5,a,i5,a)') '(', layoutnumber + 1, '/', num_procs, ') '
       called_fromobservation = .true.
diff --git a/src_main_pub/planewaves.F90 b/src_main_pub/planewaves.F90
index 55cf6a72e..43166b5b2 100755
--- a/src_main_pub/planewaves.F90
+++ b/src_main_pub/planewaves.F90
@@ -879,19 +879,19 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey,
 #ifdef CompileWithOpenMP
 !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m)
 #endif
-             do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez
-                k_m = k - b%Ez%ZI
-                do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez
-                   j_m = j - b%Ez%YI
-                   !--->
-                   incidente = Incid(sgg,jjj, iHy, timei, i-1, j, k,still_planewave_time,called_fromobservation)
-                   Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) - G2_1 * incidente * Id
-                end do
-             end do
+              do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez
+                 k_m = k - b%Ez%ZI
+                 do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez
+                    j_m = j - b%Ez%YI
+                    !--->
+                    incidente = Incid(sgg,jjj, iHy, timei, i-1, j, k,still_planewave_time,called_fromobservation)
+                    Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) - G2_1 * incidente * Id
+                 end do
+              end do
 #ifdef CompileWithOpenMP
 !$OMP END PARALLEL DO
 #endif
-             !Ey Back
+              !Ey Back
              i = TrFr(jjj)%I%tra%Ey  !Back
              i_m = i - b%Ey%XI
              Id = Idxh( i_m )
@@ -899,21 +899,21 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey,
 #ifdef CompileWithOpenMP
 !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m)
 #endif
-             do k = TrFr(jjj)%K%com%Ey, TrFr(jjj)%K%fin%Ey
-                k_m = k - b%Ey%ZI
-                do j = TrFr(jjj)%J%com%Ey, TrFr(jjj)%J%fin%Ey
-                   j_m = j - b%Ey%YI
-                   !--->
-                   incidente = Incid(sgg,jjj, iHz, timei, i-1, j, k,still_planewave_time,called_fromobservation)
-                   Ey( i_m, j_m, k_m) = Ey( i_m, j_m, k_m) + G2_1 * incidente * Id
-                End do
-             end do
+              do k = TrFr(jjj)%K%com%Ey, TrFr(jjj)%K%fin%Ey
+                 k_m = k - b%Ey%ZI
+                 do j = TrFr(jjj)%J%com%Ey, TrFr(jjj)%J%fin%Ey
+                    j_m = j - b%Ey%YI
+                    !--->
+                    incidente = Incid(sgg,jjj, iHz, timei, i-1, j, k,still_planewave_time,called_fromobservation)
+                    Ey( i_m, j_m, k_m) = Ey( i_m, j_m, k_m) + G2_1 * incidente * Id
+                 End do
+              end do
 #ifdef CompileWithOpenMP
 !$OMP END PARALLEL DO
 #endif
-          end if
-          !--->
-          If( IluminaFr(jjj)) then
+           end if
+           !--->
+           If( IluminaFr(jjj)) then
              !Ez  Front
              i = TrFr(jjj)%I%fro%Ez !Front
              i_m = i - b%Ez%XI
@@ -922,19 +922,19 @@ subroutine AdvancePlaneWaveE( sgg, timeinstant, b, g2, Idxh, Idyh, Idzh, Ex, Ey,
 #ifdef CompileWithOpenMP
 !$OMP PARALLEL do DEFAULT(SHARED) private (incidente,j,k,j_m,k_m)
 #endif
-             do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez
-                k_m = k - b%Ez%ZI
-                do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez
-                   j_m = j - b%Ez%YI
-                   !--->
-                   incidente = Incid(sgg,jjj, iHy, timei, i, j, k,still_planewave_time,called_fromobservation)
-                   Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) + G2_1 * incidente * Id
-                end do
-             end do
+              do k = TrFr(jjj)%K%com%Ez, TrFr(jjj)%K%fin%Ez
+                 k_m = k - b%Ez%ZI
+                 do j = TrFr(jjj)%J%com%Ez, TrFr(jjj)%J%fin%Ez
+                    j_m = j - b%Ez%YI
+                    !--->
+                    incidente = Incid(sgg,jjj, iHy, timei, i, j, k,still_planewave_time,called_fromobservation)
+                    Ez( i_m, j_m, k_m) = Ez( i_m, j_m, k_m) + G2_1 * incidente * Id
+                 end do
+              end do
 #ifdef CompileWithOpenMP
 !$OMP END PARALLEL DO
 #endif
-             !Ey  Front
+              !Ey  Front
              i = TrFr(jjj)%I%fro%Ey !Front
              i_m = i - b%Ey%XI
              Id = Idxh( i_m )
diff --git a/src_main_pub/timestepping.F90 b/src_main_pub/timestepping.F90
index 546ca5ce9..a9cbd6107 100755
--- a/src_main_pub/timestepping.F90
+++ b/src_main_pub/timestepping.F90
@@ -45,6 +45,17 @@ module Solver_m
    use sgbc_stoch
 #else
    use SGBC_nostoch_m
+#endif
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+   ! SGBC GPU not compatible with NVHPC 26.3
+   ! use gpu_sgbc_core_m
+   ! use gpu_sgbc_e_m
+   ! use gpu_sgbc_h_m
+   use gpu_core_probe_m
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+   use farfield_m
+   use gpu_nf2ff_m
+#endif
 #endif  
    use EDispersives_m
    use Mdispersives_m
@@ -79,6 +90,12 @@ module Solver_m
 !!
 #ifdef CompileWithProfiling
    use nvtx
+#endif
+#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+   use gpu_core_m
+   use gpu_yee_m
+   use gpu_cpml_m
+   use gpu_mur_m
 #endif
    implicit none
 
@@ -114,6 +131,11 @@ module Solver_m
       type(mtln_t) :: mtln_parsed
 #endif
 
+#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+      type(gpu_state_t) :: gpu
+      logical :: gpu_initialized = .false.
+#endif
+
    contains
       procedure :: init => solver_init
       procedure :: run => solver_run
@@ -465,6 +487,22 @@ subroutine solver_init(this)
 
       call this%init_fields()
       Ex => this%Ex; Ey => this%Ey; Ez => this%Ez; Hx => this%Hx; Hy => this%Hy; Hz => this%Hz
+
+#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN) || defined(CompileWithACC)
+       if (.not.this%gpu_initialized) then
+          call gpu_init(this%gpu, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz, &
+                        this%media%sggMiEx, this%media%sggMiEy, this%media%sggMiEz, &
+                        this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, &
+                        this%g%g1, this%g%g2, this%g%gm1, this%g%gm2, &
+                        this%Idxe, this%Idye, this%Idze, this%Idxh, this%Idyh, this%Idzh, &
+                        this%dxe, this%dye, this%dze, this%dxh, this%dyh, this%dzh)
+          this%gpu_initialized = this%gpu%initialized
+          ! Initialize probe buffers for on-device sampling
+          if (this%gpu_initialized .and. this%thereAre%Observation) then
+             call gpu_init_probe_buffers(this%gpu, this%sgg)
+          end if
+       endif
+#endif
       
       !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
       !!! Init the local variables and observation stuff needed by each module, taking into account resume status
@@ -588,6 +626,12 @@ subroutine solver_init(this)
 
       call fillMtag(this%sgg, this%media%sggMiEx, this%media%sggMiEy, this%media%sggMiEz, this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz,this%media%sggMtag, this%bounds, this%tag_numbers)
       call initializeObservation()
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+        ! Initialize GPU NF2FF buffers after InitFarField (deferred — full implementation pending)
+        ! if (this%gpu_initialized .and. associated(FF%ExIz)) then
+        !    call gpu_init_nf2ff_buffers(...)
+        ! endif
+#endif
 
       !!!!voy a jugar con fuego !!!210815 sincronizo las matrices de medios porque a veces se precisan. Reutilizo rutinas viejas mias NO CRAY. Solo se usan aqui
       !MPI initialization
@@ -1028,13 +1072,25 @@ subroutine reportSimulationOptions()
          end if
       end subroutine
 
-      subroutine initializeBorders()
-         character(len=BUFSIZE) :: dubuf
-         logical :: l_auxinput, l_auxoutput
-#ifdef CompileWithMPI
-         integer(kind=4) :: ierr
-#endif
-         write(dubuf,*) 'Init Other Borders...';  call print11(this%control%layoutnumber,dubuf)
+ subroutine initializeBorders()
+           character(len=BUFSIZE) :: dubuf
+           logical :: l_auxinput, l_auxoutput
+  #ifdef CompileWithMPI
+           integer(kind=4) :: ierr
+  #endif
+           integer(kind=4) :: left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj
+           integer(kind=4) :: left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj
+           integer(kind=4) :: right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj
+           integer(kind=4) :: right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj
+           integer(kind=4) :: down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj
+           integer(kind=4) :: down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj
+           integer(kind=4) :: up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj
+           integer(kind=4) :: up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj
+           integer(kind=4) :: back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj
+           integer(kind=4) :: back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj
+           integer(kind=4) :: front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj
+           integer(kind=4) :: front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj
+          write(dubuf,*) 'Init Other Borders...';  call print11(this%control%layoutnumber,dubuf)
          call InitOtherBorders    (this%sgg,this%thereAre)
          l_auxinput=this%thereAre%PECBorders.or.this%thereAre%PMCBorders.or.this%thereAre%PeriodicBorders
          l_auxoutput=l_auxinput
@@ -1052,10 +1108,62 @@ subroutine initializeBorders()
          call MPI_Barrier(SUBCOMM_MPI,ierr)
 #endif
          write(dubuf,*) 'Init CPML Borders...';  call print11(this%control%layoutnumber,dubuf)
-         call InitCPMLBorders     (this%sgg,this%sinPML_fullsize,this%thereAre%PMLBorders,this%control, &
-                                 dxe,dye,dze,dxh,dyh,dzh,Idxe,Idye,Idze,Idxh,Idyh,Idzh,this%eps0,this%mu0)
-
-         l_auxinput=this%thereAre%PMLBorders
+          call InitCPMLBorders     (this%sgg,this%sinPML_fullsize,this%thereAre%PMLBorders,this%control, &
+                                  dxe,dye,dze,dxh,dyh,dzh,Idxe,Idye,Idze,Idxh,Idyh,Idzh,this%eps0,this%mu0)
+
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+            if (this%gpu_initialized .and. this%thereAre%PMLBorders) then
+               write(dubuf,*) 'Init CPML GPU left boundary...';  call print11(this%control%layoutnumber,dubuf)
+               call gpu_init_pml_left(this%gpu, P_be_y, P_ce_y, P_bm_y, P_cm_y, &
+                                      PMLc(iEx)%XI(left), PMLc(iEx)%XE(left), PMLc(iEx)%YI(left), PMLc(iEx)%YE(left), PMLc(iEx)%ZI(left), PMLc(iEx)%ZE(left), &
+                                      PMLc(iEz)%XI(left), PMLc(iEz)%XE(left), PMLc(iEz)%YI(left), PMLc(iEz)%YE(left), PMLc(iEz)%ZI(left), PMLc(iEz)%ZE(left), &
+                                      PMLc(iHx)%XI(left), PMLc(iHx)%XE(left), PMLc(iHx)%YI(left), PMLc(iHx)%YE(left), PMLc(iHx)%ZI(left), PMLc(iHx)%ZE(left), &
+                                      PMLc(iHz)%XI(left), PMLc(iHz)%XE(left), PMLc(iHz)%YI(left), PMLc(iHz)%YE(left), PMLc(iHz)%ZI(left), PMLc(iHz)%ZE(left), &
+                                      this%gpu%Ex_ny, this%gpu%Ex_nz, this%gpu%Ez_ny, this%gpu%Ez_nz, this%gpu%Hx_ny, this%gpu%Hx_nz, this%gpu%Hz_ny, this%gpu%Hz_nz)
+               if (this%gpu%pml_left_initialized) then
+                  write(dubuf,*) 'Init CPML GPU right boundary...';  call print11(this%control%layoutnumber,dubuf)
+                  call gpu_init_pml_right(this%gpu, &
+                                         PMLc(iEx)%XI(right), PMLc(iEx)%XE(right), PMLc(iEx)%YI(right), PMLc(iEx)%YE(right), PMLc(iEx)%ZI(right), PMLc(iEx)%ZE(right), &
+                                         PMLc(iEz)%XI(right), PMLc(iEz)%XE(right), PMLc(iEz)%YI(right), PMLc(iEz)%YE(right), PMLc(iEz)%ZI(right), PMLc(iEz)%ZE(right), &
+                                         PMLc(iHx)%XI(right), PMLc(iHx)%XE(right), PMLc(iHx)%YI(right), PMLc(iHx)%YE(right), PMLc(iHx)%ZI(right), PMLc(iHx)%ZE(right), &
+                                         PMLc(iHz)%XI(right), PMLc(iHz)%XE(right), PMLc(iHz)%YI(right), PMLc(iHz)%YE(right), PMLc(iHz)%ZI(right), PMLc(iHz)%ZE(right))
+               endif
+               if (this%gpu%pml_right_initialized) then
+                  write(dubuf,*) 'Init CPML GPU down boundary...';  call print11(this%control%layoutnumber,dubuf)
+                  call gpu_init_pml_down(this%gpu, P_be_z, P_ce_z, P_bm_z, P_cm_z, &
+                                         PMLc(iEx)%XI(down), PMLc(iEx)%XE(down), PMLc(iEx)%YI(down), PMLc(iEx)%YE(down), PMLc(iEx)%ZI(down), PMLc(iEx)%ZE(down), &
+                                         PMLc(iEy)%XI(down), PMLc(iEy)%XE(down), PMLc(iEy)%YI(down), PMLc(iEy)%YE(down), PMLc(iEy)%ZI(down), PMLc(iEy)%ZE(down), &
+                                         PMLc(iHx)%XI(down), PMLc(iHx)%XE(down), PMLc(iHx)%YI(down), PMLc(iHx)%YE(down), PMLc(iHx)%ZI(down), PMLc(iHx)%ZE(down), &
+                                         PMLc(iHy)%XI(down), PMLc(iHy)%XE(down), PMLc(iHy)%YI(down), PMLc(iHy)%YE(down), PMLc(iHy)%ZI(down), PMLc(iHy)%ZE(down))
+                  if (this%gpu%pml_down_initialized) then
+                     write(dubuf,*) 'Init CPML GPU up boundary...';  call print11(this%control%layoutnumber,dubuf)
+                     call gpu_init_pml_up(this%gpu, &
+                                          PMLc(iEx)%XI(up), PMLc(iEx)%XE(up), PMLc(iEx)%YI(up), PMLc(iEx)%YE(up), PMLc(iEx)%ZI(up), PMLc(iEx)%ZE(up), &
+                                          PMLc(iEy)%XI(up), PMLc(iEy)%XE(up), PMLc(iEy)%YI(up), PMLc(iEy)%YE(up), PMLc(iEy)%ZI(up), PMLc(iEy)%ZE(up), &
+                                          PMLc(iHx)%XI(up), PMLc(iHx)%XE(up), PMLc(iHx)%YI(up), PMLc(iHx)%YE(up), PMLc(iHx)%ZI(up), PMLc(iHx)%ZE(up), &
+                                          PMLc(iHy)%XI(up), PMLc(iHy)%XE(up), PMLc(iHy)%YI(up), PMLc(iHy)%YE(up), PMLc(iHy)%ZI(up), PMLc(iHy)%ZE(up))
+                  endif
+                  if (this%gpu%pml_up_initialized) then
+                     write(dubuf,*) 'Init CPML GPU back boundary...';  call print11(this%control%layoutnumber,dubuf)
+                     call gpu_init_pml_back(this%gpu, P_be_x, P_ce_x, P_bm_x, P_cm_x, &
+                                            PMLc(iEz)%XI(back), PMLc(iEz)%XE(back), PMLc(iEz)%YI(back), PMLc(iEz)%YE(back), PMLc(iEz)%ZI(back), PMLc(iEz)%ZE(back), &
+                                            PMLc(iEy)%XI(back), PMLc(iEy)%XE(back), PMLc(iEy)%YI(back), PMLc(iEy)%YE(back), PMLc(iEy)%ZI(back), PMLc(iEy)%ZE(back), &
+                                            PMLc(iHz)%XI(back), PMLc(iHz)%XE(back), PMLc(iHz)%YI(back), PMLc(iHz)%YE(back), PMLc(iHz)%ZI(back), PMLc(iHz)%ZE(back), &
+                                            PMLc(iHy)%XI(back), PMLc(iHy)%XE(back), PMLc(iHy)%YI(back), PMLc(iHy)%YE(back), PMLc(iHy)%ZI(back), PMLc(iHy)%ZE(back))
+                     if (this%gpu%pml_back_initialized) then
+                        write(dubuf,*) 'Init CPML GPU front boundary...';  call print11(this%control%layoutnumber,dubuf)
+                        call gpu_init_pml_front(this%gpu, &
+                                               PMLc(iEz)%XI(front), PMLc(iEz)%XE(front), PMLc(iEz)%YI(front), PMLc(iEz)%YE(front), PMLc(iEz)%ZI(front), PMLc(iEz)%ZE(front), &
+                                               PMLc(iEy)%XI(front), PMLc(iEy)%XE(front), PMLc(iEy)%YI(front), PMLc(iEy)%YE(front), PMLc(iEy)%ZI(front), PMLc(iEy)%ZE(front), &
+                                               PMLc(iHz)%XI(front), PMLc(iHz)%XE(front), PMLc(iHz)%YI(front), PMLc(iHz)%YE(front), PMLc(iHz)%ZI(front), PMLc(iHz)%ZE(front), &
+                                               PMLc(iHy)%XI(front), PMLc(iHy)%XE(front), PMLc(iHy)%YI(front), PMLc(iHy)%YE(front), PMLc(iHy)%ZI(front), PMLc(iHy)%ZE(front))
+                     endif
+                  endif
+               endif
+            endif
+#endif
+
+          l_auxinput=this%thereAre%PMLBorders
          l_auxoutput=l_auxinput
 #ifdef CompileWithMPI
          call MPI_Barrier(SUBCOMM_MPI,ierr)
@@ -1086,9 +1194,104 @@ subroutine initializeBorders()
 #ifdef CompileWithMPI
          call MPI_Barrier(SUBCOMM_MPI,ierr)
 #endif
-         write(dubuf,*) 'Init Mur Borders...';  call print11(this%control%layoutnumber,dubuf)
-         call InitMURBorders      (this%sgg,this%thereAre%MURBorders,this%control%resume,Idxh,Idyh,Idzh,this%eps0,this%mu0)
-         l_auxinput= this%thereAre%MURBorders
+     write(dubuf,*) 'Init Mur Borders...';  call print11(this%control%layoutnumber,dubuf)
+             call InitMURBorders      (this%sgg,this%thereAre%MURBorders,this%control%resume,Idxh,Idyh,Idzh,this%eps0,this%mu0)
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+              if (this%gpu_initialized) then
+                 call gpu_init_mur_coeffs(this%gpu, this%sgg%numMedia, &
+                                          left_CAB1, left_CAB3, left_cab4, &
+                                          right_CAB1, right_CAB3, right_cab4, &
+                                          down_CAB1, down_CAB3, down_cab4, &
+                                          up_CAB1, up_CAB3, up_cab4, &
+                                          back_CAB1, back_CAB3, back_cab4, &
+                                          front_CAB1, front_CAB3, front_cab4)
+                ! Get MUR domain limits for GPU — each boundary uses 2 fields
+                  call get_mur_limits(4, 1, &
+                                       left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj)
+                  call get_mur_limits(5, 1, &
+                                      left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj)
+                 call get_mur_limits(4, 2, &
+                                      right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj)
+                 call get_mur_limits(5, 2, &
+                                      right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj)
+                 call get_mur_limits(6, 3, &
+                                      down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj)
+                 call get_mur_limits(4, 3, &
+                                      down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj)
+                 call get_mur_limits(6, 4, &
+                                      up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj)
+                 call get_mur_limits(4, 4, &
+                                      up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj)
+                 call get_mur_limits(5, 5, &
+                                      back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj)
+                 call get_mur_limits(6, 5, &
+                                      back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj)
+                 call get_mur_limits(5, 6, &
+                                      front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj)
+                 call get_mur_limits(6, 6, &
+                                      front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj)
+                 call gpu_init_mur_limits(this%gpu, &
+                                           left_Hx_ii, left_Hx_ij, left_Hx_ji, left_Hx_jj, left_Hx_ki, left_Hx_kj, &
+                                           left_Hz_ii, left_Hz_ij, left_Hz_ji, left_Hz_jj, left_Hz_ki, left_Hz_kj, &
+                                           right_Hx_ii, right_Hx_ij, right_Hx_ji, right_Hx_jj, right_Hx_ki, right_Hx_kj, &
+                                           right_Hz_ii, right_Hz_ij, right_Hz_ji, right_Hz_jj, right_Hz_ki, right_Hz_kj, &
+                                           down_Hy_ii, down_Hy_ij, down_Hy_ji, down_Hy_jj, down_Hy_ki, down_Hy_kj, &
+                                           down_Hx_ii, down_Hx_ij, down_Hx_ji, down_Hx_jj, down_Hx_ki, down_Hx_kj, &
+                                           up_Hy_ii, up_Hy_ij, up_Hy_ji, up_Hy_jj, up_Hy_ki, up_Hy_kj, &
+                                           up_Hx_ii, up_Hx_ij, up_Hx_ji, up_Hx_jj, up_Hx_ki, up_Hx_kj, &
+                                           back_Hz_ii, back_Hz_ij, back_Hz_ji, back_Hz_jj, back_Hz_ki, back_Hz_kj, &
+                                           back_Hy_ii, back_Hy_ij, back_Hy_ji, back_Hy_jj, back_Hy_ki, back_Hy_kj, &
+                                           front_Hz_ii, front_Hz_ij, front_Hz_ji, front_Hz_jj, front_Hz_ki, front_Hz_kj, &
+                                           front_Hy_ii, front_Hy_ij, front_Hy_ji, front_Hy_jj, front_Hy_ki, front_Hy_kj)
+                  write(dubuf,*) 'Init MUR GPU past fields...'
+                  call print11(this%control%layoutnumber,dubuf)
+                  call gpu_init_mur_past_fields(this%gpu, &
+                                                ubound(regLR(left)%Past_Hx,1)-lbound(regLR(left)%Past_Hx,1)+1, &
+                                                ubound(regLR(left)%Past_Hx,2)-lbound(regLR(left)%Past_Hx,2)+1, &
+                                                ubound(regLR(left)%Past_Hx,3)-lbound(regLR(left)%Past_Hx,3)+1, &
+                                                ubound(regLR(left)%Past_Hz,1)-lbound(regLR(left)%Past_Hz,1)+1, &
+                                                ubound(regLR(left)%Past_Hz,2)-lbound(regLR(left)%Past_Hz,2)+1, &
+                                                ubound(regLR(left)%Past_Hz,3)-lbound(regLR(left)%Past_Hz,3)+1, &
+                                                ubound(regLR(right)%Past_Hx,1)-lbound(regLR(right)%Past_Hx,1)+1, &
+                                                ubound(regLR(right)%Past_Hx,2)-lbound(regLR(right)%Past_Hx,2)+1, &
+                                                ubound(regLR(right)%Past_Hx,3)-lbound(regLR(right)%Past_Hx,3)+1, &
+                                                ubound(regLR(right)%Past_Hz,1)-lbound(regLR(right)%Past_Hz,1)+1, &
+                                                ubound(regLR(right)%Past_Hz,2)-lbound(regLR(right)%Past_Hz,2)+1, &
+                                                ubound(regLR(right)%Past_Hz,3)-lbound(regLR(right)%Past_Hz,3)+1, &
+                                                ubound(regDU(down)%Past_Hy,1)-lbound(regDU(down)%Past_Hy,1)+1, &
+                                                ubound(regDU(down)%Past_Hy,2)-lbound(regDU(down)%Past_Hy,2)+1, &
+                                                ubound(regDU(down)%Past_Hy,3)-lbound(regDU(down)%Past_Hy,3)+1, &
+                                                ubound(regDU(down)%Past_Hx,1)-lbound(regDU(down)%Past_Hx,1)+1, &
+                                                ubound(regDU(down)%Past_Hx,2)-lbound(regDU(down)%Past_Hx,2)+1, &
+                                                ubound(regDU(down)%Past_Hx,3)-lbound(regDU(down)%Past_Hx,3)+1, &
+                                                ubound(regDU(up)%Past_Hy,1)-lbound(regDU(up)%Past_Hy,1)+1, &
+                                                ubound(regDU(up)%Past_Hy,2)-lbound(regDU(up)%Past_Hy,2)+1, &
+                                                ubound(regDU(up)%Past_Hy,3)-lbound(regDU(up)%Past_Hy,3)+1, &
+                                                ubound(regDU(up)%Past_Hx,1)-lbound(regDU(up)%Past_Hx,1)+1, &
+                                                ubound(regDU(up)%Past_Hx,2)-lbound(regDU(up)%Past_Hx,2)+1, &
+                                                ubound(regDU(up)%Past_Hx,3)-lbound(regDU(up)%Past_Hx,3)+1, &
+                                                ubound(regBF(back)%Past_Hz,1)-lbound(regBF(back)%Past_Hz,1)+1, &
+                                                ubound(regBF(back)%Past_Hz,2)-lbound(regBF(back)%Past_Hz,2)+1, &
+                                                ubound(regBF(back)%Past_Hz,3)-lbound(regBF(back)%Past_Hz,3)+1, &
+                                                ubound(regBF(back)%Past_Hy,1)-lbound(regBF(back)%Past_Hy,1)+1, &
+                                                ubound(regBF(back)%Past_Hy,2)-lbound(regBF(back)%Past_Hy,2)+1, &
+                                                ubound(regBF(back)%Past_Hy,3)-lbound(regBF(back)%Past_Hy,3)+1, &
+                                                ubound(regBF(front)%Past_Hz,1)-lbound(regBF(front)%Past_Hz,1)+1, &
+                                                ubound(regBF(front)%Past_Hz,2)-lbound(regBF(front)%Past_Hz,2)+1, &
+                                                ubound(regBF(front)%Past_Hz,3)-lbound(regBF(front)%Past_Hz,3)+1, &
+                                                ubound(regBF(front)%Past_Hy,1)-lbound(regBF(front)%Past_Hy,1)+1, &
+                                                ubound(regBF(front)%Past_Hy,2)-lbound(regBF(front)%Past_Hy,2)+1, &
+                                                ubound(regBF(front)%Past_Hy,3)-lbound(regBF(front)%Past_Hy,3)+1, &
+                                                regLR(left)%Past_Hx, regLR(left)%Past_Hz, &
+                                                regLR(right)%Past_Hx, regLR(right)%Past_Hz, &
+                                                regDU(down)%Past_Hy, regDU(down)%Past_Hx, &
+                                                regDU(up)%Past_Hy, regDU(up)%Past_Hx, &
+                                                regBF(back)%Past_Hz, regBF(back)%Past_Hy, &
+                                                regBF(front)%Past_Hz, regBF(front)%Past_Hy)
+               endif
+#endif
+         ! CPU MUR path (AdvanceMagneticMUR) is used as fallback
+          l_auxinput= this%thereAre%MURBorders
          l_auxoutput=l_auxinput
 #ifdef CompileWithMPI
          call MPI_Barrier(SUBCOMM_MPI,ierr)
@@ -1729,10 +1932,12 @@ subroutine solver_run(this)
       real(kind=rkind), pointer, dimension(:) :: Idxe, Idye, Idze, Idxh, Idyh, Idzh, dxe, dye, dze, dxh, dyh, dzh
 
       logical :: call_timing, l_aux, flushFF, somethingdone, newsomethingdone
-      integer :: i
-      real(kind=rkind) :: pscale_alpha
-      real(kind=rkind_tiempo) :: at
-      character(len=bufsize) :: dubuf
+       integer :: i
+       real(kind=rkind) :: pscale_alpha
+       real(kind=rkind_tiempo) :: at
+       character(len=bufsize) :: dubuf
+       ! NF2FF GPU output buffers
+       real(kind=rkind), dimension(:), allocatable :: Etheta_out, Ephi_out, RCS_out
 #ifdef CompileWithMPI
       integer(kind=4) :: ierr
 #endif
@@ -1795,13 +2000,18 @@ subroutine solver_run(this)
                 call MPI_AllReduce( l_aux, this%perform%postprocess, 1_4, MPI_LOGICAL, MPI_LOR, SUBCOMM_MPI, ierr)
 #endif
 !!!!!!!!!!!!
-                if (this%perform%flushFIELDS) then
-                   write(dubuf,*)  SEPARADOR,trim(adjustl(this%control%nentradaroot)),separador
-                   call print11(this%control%layoutnumber,dubuf)
-                   write(dubuf,*)  'INIT FLUSHING OF RESTARTING FIELDS n=',this%n
-                   call print11(this%control%layoutnumber,dubuf)
-                   call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed,  &
-                   Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic)
+               if (this%perform%flushFIELDS) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+                    if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+                       call gpu_download(this%gpu)
+                    endif
+#endif
+                    write(dubuf,*)  SEPARADOR,trim(adjustl(this%control%nentradaroot)),separador
+                    call print11(this%control%layoutnumber,dubuf)
+                    write(dubuf,*)  'INIT FLUSHING OF RESTARTING FIELDS n=',this%n
+                    call print11(this%control%layoutnumber,dubuf)
+                    call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed,  &
+                    Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic)
 #ifdef CompileWithMPI
                    call MPI_Barrier(SUBCOMM_MPI,ierr)
 #endif
@@ -1824,8 +2034,18 @@ subroutine solver_run(this)
                       call print11(this%control%layoutnumber,dubuf)
                       call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
     !!
-                      if (this%thereAre%Observation) call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,flushFF)
-                      !!
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+                        if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+                           call gpu_download(this%gpu)
+                        endif
+                        ! GPU NF2FF flush — replaces CPU far-field pattern computation
+                        ! GPU NF2FF flush deferred — CPU path handles far-field computation
+                        ! if (this%gpu_initialized .and. this%gpu%nf2ff_initialized .and. this%thereAre%FarFields) then
+                        !    call gpu_flush_nf2ff(this%gpu, Etheta_out, Ephi_out, RCS_out)
+                        ! endif
+#endif
+                        if (this%thereAre%Observation) call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,flushFF)
+                       !!
 #ifdef CompileWithMPI
                       call MPI_Barrier(SUBCOMM_MPI,ierr)
 #endif
@@ -1865,12 +2085,17 @@ subroutine solver_run(this)
                       end if
                   !!       
                       if (this%perform%flushvtk) then   
-                         write(dubuf,'(a,i9)')  ' Post-processing .vtk files n= ',this%n
-                         call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
-                         call print11(this%control%layoutnumber,dubuf)
-                         call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
-                         somethingdone=.false.
-                         if (this%thereAre%Observation) call createvtkOnTheFly(this%control%layoutnumber,this%control%num_procs,this%sgg,this%control%vtkindex,somethingdone,this%control%mpidir,this%media%sggMtag,this%control%dontwritevtk)
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+                          if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+                             call gpu_download(this%gpu)
+                          endif
+#endif
+                          write(dubuf,'(a,i9)')  ' Post-processing .vtk files n= ',this%n
+                          call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
+                          call print11(this%control%layoutnumber,dubuf)
+                          call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
+                          somethingdone=.false.
+                          if (this%thereAre%Observation) call createvtkOnTheFly(this%control%layoutnumber,this%control%num_procs,this%sgg,this%control%vtkindex,somethingdone,this%control%mpidir,this%media%sggMtag,this%control%dontwritevtk)
 #ifdef CompileWithMPI
                          call MPI_Barrier(SUBCOMM_MPI,ierr)
                          call MPI_AllReduce( somethingdone, newsomethingdone, 1_4, MPI_LOGICAL, MPI_LOR, SUBCOMM_MPI, ierr)
@@ -1971,16 +2196,78 @@ subroutine solver_run(this)
       end do ciclo_temporal ! End of the time-stepping loop
 
 contains
-      subroutine updateAndFlush()
-         integer(kind=4) :: mindum
-         if (this%thereAre%Observation) then
-            call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds)
+subroutine updateAndFlush()
+            integer(kind=4) :: mindum, ii, i, idx
+            real(kind=rkind), allocatable, dimension(:) :: point_results, block_results
+            integer(kind=4) :: pointCount, blockCount
+            integer(kind=4) :: pointObservationCases(6), blockObservationCases(6)
+ #if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+            ! GPU optimization: keep fields on device between timesteps.
+            ! Sample probes directly on GPU — no field download needed.
+            if (this%gpu_initialized .and. this%gpu%fields_on_device .and. this%thereAre%Observation) then
+               pointObservationCases = [iEx, iEy, iEz, iHx, iHy, iHz]
+               blockObservationCases = [iBloqueJx, iBloqueJy, iBloqueJz, iBloqueMx, iBloqueMy, iBloqueMz]
+               pointCount = 0
+               blockCount = 0
+
+               ! Count probes by type
+               do ii = 1, this%sgg%NumberRequest
+                  if (.not. this%sgg%Observation(ii)%TimeDomain) cycle
+                  do i = 1, this%sgg%Observation(ii)%nP
+                     if (this%sgg%Observation(ii)%P(i)%what == nothing) cycle
+                     if (any(this%sgg%Observation(ii)%P(i)%what == pointObservationCases)) then
+                        pointCount = pointCount + 1
+                     else if (any(this%sgg%Observation(ii)%P(i)%what == blockObservationCases)) then
+                        blockCount = blockCount + 1
+                     end if
+                  end do
+               end do
+
+              ! Sample point probes on GPU
+                if (pointCount > 0) then
+                   allocate(point_results(pointCount))
+                   call gpu_sample_point_probes(this%gpu, point_results, this%n)
+                   call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, pointCount, 0)
+                   deallocate(point_results)
+                end if
+
+                ! Sample block probes on GPU
+                if (blockCount > 0) then
+                   allocate(block_results(blockCount))
+                   call gpu_sample_block_probes(this%gpu, block_results, this%n)
+                   call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, 0, blockCount)
+                   deallocate(block_results)
+                end if
+
+                ! Fused probe sampling (alternative — single kernel launch for both types)
+                ! if (pointCount > 0 .or. blockCount > 0) then
+                !    if (pointCount > 0) allocate(point_results(pointCount))
+                !    if (blockCount > 0) allocate(block_results(blockCount))
+                !    call gpu_sample_all_probes(this%gpu, point_results, block_results, this%n)
+                !    call UpdateProbeResultsFromGPU(this%sgg, this%n, this%ini_save, point_results, block_results, pointCount, blockCount)
+                !    if (pointCount > 0) deallocate(point_results)
+                !    if (blockCount > 0) deallocate(block_results)
+                ! end if
+
+            else
+               ! CPU path — download fields and call UpdateObservation
+               if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+                  call gpu_download(this%gpu)
+               endif
+               if (this%thereAre%Observation) then
+                  call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds)
+               end if
+            endif
+ #else
+            if (this%thereAre%Observation) then
+               call UpdateObservation(this%sgg,this%media,this%tag_numbers, this%n,this%ini_save, Ex, Ey, Ez, Hx, Hy, Hz, dxe, dye, dze, dxh, dyh, dzh,this%control%wiresflavor,this%sinPML_fullsize,this%control%wirecrank, this%control%noconformalmapvtk,this%bounds)
+            endif
+ #endif
             if (this%n>=this%ini_save+BuffObse)  then
                mindum=min(this%control%finaltimestep,this%ini_save+BuffObse)
                call FlushObservationFiles(this%sgg,this%ini_save,mindum,this%control%layoutnumber,this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.FALSE.) !no se flushean los farfields ahora
             end if
-         end if
-      end subroutine
+         end subroutine
 
       subroutine singleUnpack()
          character(len=BUFSIZE) :: dubuf
@@ -2066,8 +2353,6 @@ subroutine step(this)
       call this%advancePlaneWaveH()
       call this%advanceNodalH()
       call this%advanceWiresH()
-      call this%MinusCloneMagneticPMC()
-      call this%CloneMagneticPeriodic()
 
 #ifdef CompileWithMPI
       !!Flush all the MPI (esto estaba justo al principo del bucle temporal diciendo que era necesario para correcto resuming)
@@ -2158,29 +2443,34 @@ subroutine init_MPIConformalProbes(this)
    end subroutine init_MPIConformalProbes
 #endif
 
-   subroutine advanceE(this)
-      class(solver_t) :: this
-#ifdef CompileWithProfiling
-      call nvtxStartRange("Antes del bucle EX")
+subroutine advanceE(this)
+        class(solver_t) :: this
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+        if (this%gpu_initialized) then
+           this%gpu%gpu_e_fused_launched = .false.
+        endif
 #endif
-      call this%advanceEx(this%media%sggMiEx)
 #ifdef CompileWithProfiling
-      call nvtxEndRange
+        call nvtxStartRange("Antes del bucle EX")
+#endif
+       call this%advanceEx(this%media%sggMiEx)
+#ifdef CompileWithProfiling
+       call nvtxEndRange
 
-      call nvtxStartRange("Antes del bucle EY")
+       call nvtxStartRange("Antes del bucle EY")
 #endif
-      call this%advanceEy(this%media%sggMiEy)
-      
-#ifdef CompileWithProfiling    
-      call nvtxEndRange
+       call this%advanceEy(this%media%sggMiEy)
 
-      call nvtxStartRange("Antes del bucle EZ")
+#ifdef CompileWithProfiling
+       call nvtxEndRange
+
+       call nvtxStartRange("Antes del bucle EZ")
 #endif
-      call this%advanceEz(this%media%sggMiEz)
-#ifdef CompileWithProfiling    
-      call nvtxEndRange
+       call this%advanceEz(this%media%sggMiEz)
+#ifdef CompileWithProfiling
+        call nvtxEndRange
 #endif
-   end subroutine
+    end subroutine
 
    subroutine advanceEx(this, sggMiEx)
       class(solver_t) :: this
@@ -2196,6 +2486,16 @@ subroutine advanceEx(this, sggMiEx)
       integer(kind=4) :: i, j, k
       integer(kind=integersizeofmediamatrices) :: medio
 
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+          if (this%gpu_initialized) then
+             if (.not. this%gpu%gpu_e_fused_launched) then
+                call gpu_advanceE_all(this%gpu, this%bounds)
+                this%gpu%gpu_e_fused_launched = .true.
+             endif
+             return
+          endif
+#endif
+
       Ex(0:this%bounds%Ex%NX-1,0:this%bounds%Ex%NY-1,0:this%bounds%Ex%NZ-1) => this%Ex
       Hy(0:this%bounds%Hy%NX-1,0:this%bounds%Hy%NY-1,0:this%bounds%Hy%NZ-1) => this%Hy
       Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz
@@ -2206,10 +2506,7 @@ subroutine advanceEx(this, sggMiEx)
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzhk,Idyhj) 
 #endif
-#ifdef CompileWithACC   
-!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzhk,Idyhj)  copyin(Ex,sggMiEx,Hy,Hz,Idyh,Idzh,b,G1,G2) copyout(Ex) 
-#endif
-      Do k=1,this%bounds%sweepEx%NZ
+       Do k=1,this%bounds%sweepEx%NZ
          Do j=1,this%bounds%sweepEx%NY
             Do i=1,this%bounds%sweepEx%NX
                Idzhk=Idzh(k)
@@ -2239,6 +2536,16 @@ subroutine advanceEy(this,sggMiEy)
       integer(kind=4) :: i, j, k
       integer(kind=integersizeofmediamatrices) :: medio
 
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+       if (this%gpu_initialized) then
+          if (.not. this%gpu%gpu_e_fused_launched) then
+             call gpu_advanceE_all(this%gpu, this%bounds)
+             this%gpu%gpu_e_fused_launched = .true.
+          endif
+          return
+       endif
+#endif
+
       Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey
       Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz
       Hx(0:this%bounds%Hx%NX-1,0:this%bounds%Hx%NY-1,0:this%bounds%Hx%NZ-1) => this%Hx
@@ -2248,9 +2555,6 @@ subroutine advanceEy(this,sggMiEy)
 
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzhk)  
-#endif
-#ifdef CompileWithACC   
-!$ACC parallel loop  DEFAULT(present) collapse (2) private (i,j,k,medio,Idzhk)     copyin(Ey,sggMiEy,Hz,Hx,Idzh,Idxh,b,G1,G2) copyout(Ey) 
 #endif
       Do k=1,this%bounds%sweepEy%NZ
          Do j=1,this%bounds%sweepEy%NY
@@ -2284,6 +2588,16 @@ subroutine advanceEz(this,sggMiEz)
       integer(kind = 4) :: i, j, k
       integer(kind = INTEGERSIZEOFMEDIAMATRICES) :: medio
 
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+       if (this%gpu_initialized) then
+          if (.not. this%gpu%gpu_e_fused_launched) then
+             call gpu_advanceE_all(this%gpu, this%bounds)
+             this%gpu%gpu_e_fused_launched = .true.
+          endif
+          return
+       endif
+#endif
+
 
       Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez
       Hx(0:this%bounds%HX%NX-1,0:this%bounds%HX%NY-1,0:this%bounds%HX%NZ-1) => this%Hx
@@ -2295,10 +2609,7 @@ subroutine advanceEz(this,sggMiEz)
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do  DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idyhj)    
 #endif
-#ifdef CompileWithACC   
-!$ACC parallel loop   DEFAULT(present) collapse (2) private (i,j,k,medio,Idyhj)        copyin(Ez,sggMiEz,Hx,Hy,Idxh,Idyh,b,G1,G2) copyout(Ez) 
-#endif
-      Do k=1,this%bounds%sweepEz%NZ
+       Do k=1,this%bounds%sweepEz%NZ
          Do j=1,this%bounds%sweepEz%NY
             Do i=1,this%bounds%sweepEz%NX
                Idyhj=Idyh(j)
@@ -2347,6 +2658,16 @@ subroutine advanceHx(this, sggMiHx)
       integer(kind=4) :: i, j, k
       integer(kind=integersizeofmediamatrices) :: medio
 
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+       if (this%gpu_initialized) then
+          if (.not. this%gpu%gpu_h_fused_launched) then
+             call gpu_advanceH_all(this%gpu, this%bounds)
+             this%gpu%gpu_h_fused_launched = .true.
+          endif
+          return
+       endif
+#endif
+
       Hx(0:this%bounds%Hx%NX-1,0:this%bounds%Hx%NY-1,0:this%bounds%Hx%NZ-1) => this%Hx
       Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey
       Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez
@@ -2358,10 +2679,7 @@ subroutine advanceHx(this, sggMiHx)
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do  DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzek,Idyej)     
 #endif
-#ifdef CompileWithACC   
-!$ACC parallel loop  DEFAULT(present) collapse (2) private (i,j,k,medio,Idzek,Idyej)       copyin(Hx,sggMiHx,Ey,Ez,Idye,Idze,b,GM1,GM2) copyout(Hx) 
-#endif
-      Do k=1,this%bounds%sweepHx%NZ
+       Do k=1,this%bounds%sweepHx%NZ
          Do j=1,this%bounds%sweepHx%NY
             Do i=1,this%bounds%sweepHx%NX
             Idzek=Idze(k)
@@ -2390,6 +2708,16 @@ subroutine advanceHy(this, sggMiHy)
       integer(kind=4) :: i, j, k
       integer(kind=integersizeofmediamatrices) :: medio
 
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+       if (this%gpu_initialized) then
+          if (.not. this%gpu%gpu_h_fused_launched) then
+             call gpu_advanceH_all(this%gpu, this%bounds)
+             this%gpu%gpu_h_fused_launched = .true.
+          endif
+          return
+       endif
+#endif
+
       Hy(0:this%bounds%Hy%NX-1,0:this%bounds%Hy%NY-1,0:this%bounds%Hy%NZ-1) => this%Hy
       Ez(0:this%bounds%Ez%NX-1,0:this%bounds%Ez%NY-1,0:this%bounds%Ez%NZ-1) => this%Ez
       Ex(0:this%bounds%Ex%NX-1,0:this%bounds%Ex%NY-1,0:this%bounds%Ex%NZ-1) => this%Ex
@@ -2400,10 +2728,7 @@ subroutine advanceHy(this, sggMiHy)
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idzek)     
 #endif
-#ifdef CompileWithACC   
-!$ACC parallel loop DEFAULT(present) collapse (2) private (i,j,k,medio,Idzek)         copyin(Hy,sggMiHy,Ez,Ex,Idze,Idxe,b,GM1,GM2) copyout(Hy) 
-#endif
-      Do k=1,this%bounds%sweepHy%NZ
+       Do k=1,this%bounds%sweepHy%NZ
          Do j=1,this%bounds%sweepHy%NY
             Do i=1,this%bounds%sweepHy%NX
                Idzek=Idze(k)
@@ -2421,6 +2746,7 @@ end subroutine advanceHy
    subroutine advanceHz(this, sggMiHz)
       class(solver_t) :: this
       integer(kind=integersizeofmediamatrices), dimension(0:this%bounds%sggMiHz%NX-1,0:this%bounds%sggMiHz%NY-1,0:this%bounds%sggMiHz%NZ-1), intent(in) :: sggMiHz
+
       real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Hz
       real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Ex
       real(kind=rkind), dimension(:,:,:), pointer, contiguous :: Ey
@@ -2430,6 +2756,17 @@ subroutine advanceHz(this, sggMiHz)
       real(kind = RKIND) :: Idyej
       integer(kind = 4) :: i, j, k
       integer(kind = INTEGERSIZEOFMEDIAMATRICES) :: medio
+
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+       if (this%gpu_initialized) then
+          if (.not. this%gpu%gpu_h_fused_launched) then
+             call gpu_advanceH_all(this%gpu, this%bounds)
+             this%gpu%gpu_h_fused_launched = .true.
+          endif
+          return
+       endif
+#endif
+
       Hz(0:this%bounds%Hz%NX-1,0:this%bounds%Hz%NY-1,0:this%bounds%Hz%NZ-1) => this%Hz
       Ex(0:this%bounds%EX%NX-1,0:this%bounds%EX%NY-1,0:this%bounds%EX%NZ-1) => this%Ex
       Ey(0:this%bounds%Ey%NX-1,0:this%bounds%Ey%NY-1,0:this%bounds%Ey%NZ-1) => this%Ey
@@ -2439,10 +2776,7 @@ subroutine advanceHz(this, sggMiHz)
 #ifdef CompileWithOpenMP
 !$OMP  PARALLEL do DEFAULT(SHARED) collapse (2) private (i,j,k,medio,Idyej)  
 #endif
-#ifdef CompileWithACC   
-!$ACC parallel loop  DEFAULT(present) collapse (2) private (i,j,k,medio,Idyej)       copyin(Hz,sggMiHz,Ex,Ey,Idxe,Idye,b,GM1,GM2) copyout(Hz)
-#endif
-      Do k=1,this%bounds%sweepHz%NZ
+       Do k=1,this%bounds%sweepHz%NZ
          Do j=1,this%bounds%sweepHz%NY
             Do i=1,this%bounds%sweepHz%NX
                Idyej=Idye(j)
@@ -2452,13 +2786,15 @@ subroutine advanceHz(this, sggMiHz)
          End do
       End do
 #ifdef CompileWithOpenMP
-!$OMP  END PARALLEL DO
+ !$OMP  END PARALLEL DO
 #endif
-      return
-   end subroutine advanceHz
 
+       return
+     end subroutine advanceHz
 
-   subroutine solver_advanceEDispersiveE(this)
+
+
+    subroutine solver_advanceEDispersiveE(this)
       class(solver_t) :: this
       if (this%thereAre%Edispersives) call AdvanceEDispersiveE(this%sgg)
    end subroutine
@@ -2537,13 +2873,37 @@ subroutine solver_advancePMLbodyH(this)
       if (this%thereAre%PMLbodies) call AdvancePMLbodyH()
    end subroutine
 
-   subroutine solver_advanceMagneticCPML(this)
-      class(solver_t) :: this
-      If (this%thereAre%PMLBorders) call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & 
-                                                             this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & 
-                                                             this%g%gm2, this%Hx, this%Hy, this%Hz, & 
-                                                             this%Ex, this%Ey, this%Ez)
-   end subroutine
+  subroutine solver_advanceMagneticCPML(this)
+         class(solver_t) :: this
+         If (this%thereAre%PMLBorders) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+            if (this%gpu_initialized) then
+               this%gpu%gpu_h_fused_launched = .false.
+            endif
+            if (this%gpu_initialized .and. this%gpu%pml_left_initialized .and. this%gpu%pml_right_initialized .and. &
+                 this%gpu%pml_down_initialized .and. this%gpu%pml_up_initialized .and. &
+                 this%gpu%pml_back_initialized .and. this%gpu%pml_front_initialized) then
+                ! PML coefficients are constant - already set in gpu_init_pml_* at startup
+                call gpu_advanceCPML_H_left(this%gpu, this%bounds)
+               call gpu_advanceCPML_H_right(this%gpu, this%bounds, this%sgg%numMedia)
+               call gpu_advanceCPML_H_down(this%gpu, this%bounds)
+               call gpu_advanceCPML_H_up(this%gpu, this%bounds)
+               call gpu_advanceCPML_H_back(this%gpu, this%bounds)
+               call gpu_advanceCPML_H_front(this%gpu, this%bounds)
+            else
+               call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & 
+                                        this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & 
+                                        this%g%gm2, this%Hx, this%Hy, this%Hz, & 
+                                        this%Ex, this%Ey, this%Ez)
+            endif
+#else
+            call advanceMagneticCPML(this%sgg%numMedia, this%bounds, & 
+                                     this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, & 
+                                     this%g%gm2, this%Hx, this%Hy, this%Hz, & 
+                                     this%Ex, this%Ey, this%Ez)
+#endif
+         end if
+     end subroutine
 
    subroutine solver_MinusCloneMagneticPMC(this)
       class(solver_t) :: this
@@ -2559,30 +2919,49 @@ subroutine solver_CloneMagneticPeriodic(this)
 
 
    subroutine solver_advancePMLE(this)
-      class (solver_t) :: this
-      If (this%thereAre%PMLbodies) then !waveport absorbers
-         call AdvancePMLbodyE()
-      end if
-      If (this%thereAre%PMLBorders) then
-         call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & 
-                                  this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz)
-      end if
-   end subroutine
-
-   subroutine solver_advancesgbcE(this)
-      class(solver_t) :: this
-      if (this%thereAre%sgbcs.and.(this%control%sgbc)) then 
-         call AdvancesgbcE(real(this%sgg%dt,RKIND),this%control%sgbcDispersive, & 
-                                this%control%simu_devia,this%control%stochastic)
-      end if
-   end subroutine
-
-   subroutine solver_advancesgbcH(this)
-      class(solver_t) :: this
-      if (this%thereAre%sgbcs.and.(this%control%sgbc)) call AdvancesgbcH()
-   end subroutine
-
-   subroutine solver_advanceWiresE(this)
+         class (solver_t) :: this
+         If (this%thereAre%PMLbodies) then !waveport absorbers
+            call AdvancePMLbodyE()
+         end if
+         If (this%thereAre%PMLBorders) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+           if (this%gpu_initialized .and. this%gpu%pml_left_initialized .and. this%gpu%pml_right_initialized .and. &
+                 this%gpu%pml_down_initialized .and. this%gpu%pml_up_initialized .and. &
+                 this%gpu%pml_back_initialized .and. this%gpu%pml_front_initialized) then
+                ! PML coefficients are constant - already set in gpu_init_pml_* at startup
+                call gpu_advanceCPML_E_left(this%gpu, this%bounds)
+               call gpu_advanceCPML_E_right(this%gpu, this%bounds, this%sgg%numMedia)
+               call gpu_advanceCPML_E_down(this%gpu, this%bounds)
+               call gpu_advanceCPML_E_up(this%gpu, this%bounds)
+               call gpu_advanceCPML_E_back(this%gpu, this%bounds)
+               call gpu_advanceCPML_E_front(this%gpu, this%bounds)
+            else
+               call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & 
+                                        this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz)
+            endif
+#else
+            call AdvanceelectricCPML(this%sgg%numMedia, this%bounds,this%media%sggMiEx,this%media%sggMiEy,this%media%sggMiEz, & 
+                                     this%g%G2, this%Ex, this%Ey, this%Ez, this%Hx, this%Hy, this%Hz)
+#endif
+         end if
+      end subroutine
+
+  subroutine solver_advancesgbcE(this)
+        class(solver_t) :: this
+        if (this%thereAre%sgbcs.and.(this%control%sgbc)) then 
+           call AdvancesgbcE(real(this%sgg%dt,RKIND),this%control%sgbcDispersive, & 
+                                  this%control%simu_devia,this%control%stochastic)
+        end if
+     end subroutine
+
+ subroutine solver_advancesgbcH(this)
+        class(solver_t) :: this
+        if (this%thereAre%sgbcs.and.(this%control%sgbc)) then
+           call AdvancesgbcH()
+        end if
+   end subroutine
+
+    subroutine solver_advanceWiresE(this)
       class(solver_t) :: this
       character(len=bufsize) :: buff
 
@@ -2631,26 +3010,47 @@ subroutine solver_advancewiresH(this)
 
    end subroutine
 
-   subroutine solver_advanceMagneticMUR(this)
-      class(solver_t) :: this
-#ifdef CompileWithMPI
-      integer(kind=4) :: ierr
-#endif
-      If (this%thereAre%MURBorders) then
-         call AdvanceMagneticMUR(this%bounds, this%sgg, & 
-                                 this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, &
-                                 this%Hx, this%Hy, this%Hz, & 
-                                 this%control%mur_second)
-#ifdef CompileWithMPI
-         if (this%control%mur_second) then
-            if (this%control%num_procs>1) then
-               call MPI_Barrier(SUBCOMM_MPI,ierr)
-               call FlushMPI_H_Cray
-            end if
-         end if
-#endif
-      end if
-   end subroutine
+subroutine solver_advanceMagneticMUR(this)
+        class(solver_t) :: this
+#ifdef CompileWithMPI
+        integer(kind=4) :: ierr
+#endif
+#ifdef SEMBA_FDTD_ENABLE_CUDA_FORTRAN
+        if (this%gpu_initialized) then
+           this%gpu%gpu_h_fused_launched = .false.
+        endif
+#endif
+        If (this%thereAre%MURBorders) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+           if (this%gpu_initialized) then
+                call gpu_fused_mur_advance_hx(this%gpu, this%bounds)
+                call gpu_fused_mur_advance_hy(this%gpu, this%bounds)
+                call gpu_fused_mur_advance_hz(this%gpu, this%bounds)
+                call gpu_fused_mur_update_past_hx(this%gpu, this%bounds)
+                call gpu_fused_mur_update_past_hy(this%gpu, this%bounds)
+                call gpu_fused_mur_update_past_hz(this%gpu, this%bounds)
+            else
+               call AdvanceMagneticMUR(this%bounds, this%sgg, &
+                                      this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, &
+                                      this%Hx, this%Hy, this%Hz, & 
+                                      this%control%mur_second)
+            endif
+#else
+            call AdvanceMagneticMUR(this%bounds, this%sgg, &
+                                   this%media%sggMiHx, this%media%sggMiHy, this%media%sggMiHz, &
+                                   this%Hx, this%Hy, this%Hz, & 
+                                   this%control%mur_second)
+#endif
+  #ifdef CompileWithMPI
+           if (this%control%mur_second) then
+              if (this%control%num_procs>1) then
+                 call MPI_Barrier(SUBCOMM_MPI,ierr)
+                 call FlushMPI_H_Cray
+              end if
+           end if
+  #endif
+        end if
+     end subroutine
 
 
    subroutine solver_end(this)
@@ -2690,11 +3090,16 @@ subroutine solver_end(this)
       write(dubuf,*)'END FDTD time stepping. Beginning posprocessing at n= ',this%n
       call print11(this%control%layoutnumber,dubuf)
 
-      if ((this%control%flushsecondsFields/=0).or.this%perform%flushFIELDS) then
-         write(dubuf,'(a,i9)')  ' INIT FINAL FLUSHING OF RESTARTING FIELDS n= ',this%n
-         call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
-         call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed,  &
-         Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic)
+     if ((this%control%flushsecondsFields/=0).or.this%perform%flushFIELDS) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+          if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+             call gpu_download(this%gpu)
+          endif
+#endif
+          write(dubuf,'(a,i9)')  ' INIT FINAL FLUSHING OF RESTARTING FIELDS n= ',this%n
+          call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
+          call flush_and_save_resume(this%sgg, this%bounds, this%control%layoutnumber, this%control%num_procs, this%control%nentradaroot, this%control%nresumeable2, this%thereare, this%n,this%eps0,this%mu0, this%everflushed,  &
+          Ex, Ey, Ez, Hx, Hy, Hz,this%control%wiresflavor,this%control%simu_devia,this%control%stochastic)
          write(dubuf,'(a,i9)')  ' DONE FINAL FLUSHING OF RESTARTING FIELDS N=',this%n
          call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
          call print11(this%control%layoutnumber,dubuf)
@@ -2708,11 +3113,16 @@ subroutine solver_end(this)
       end if
       call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
       call print11(this%control%layoutnumber,dubuf)
-      call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
-      if (this%thereAre%Observation) then
-         call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.TRUE.)
-         call CloseObservationFiles(this%sgg,this%control%layoutnumber,this%control%num_procs,this%control%singlefilewrite,this%initialtimestep,this%lastexecutedtime,this%control%resume) !dump the remaining to disk
-      end if
+    call print11(this%control%layoutnumber,SEPARADOR//separador//separador)
+       if (this%thereAre%Observation) then
+#if defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+          if (this%gpu_initialized .and. this%gpu%fields_on_device) then
+             call gpu_download(this%gpu)
+          endif
+#endif
+          call FlushObservationFiles(this%sgg,this%ini_save, this%n,this%control%layoutnumber, this%control%num_procs, dxe, dye, dze, dxh, dyh, dzh,this%bounds,this%control%singlefilewrite,this%control%facesNF2FF,.TRUE.)
+          call CloseObservationFiles(this%sgg,this%control%layoutnumber,this%control%num_procs,this%control%singlefilewrite,this%initialtimestep,this%lastexecutedtime,this%control%resume) !dump the remaining to disk
+       end if
       
       if (this%thereAre%FarFields) then
          write(dubuf,'(a,i9)')   ' DONE FINAL OBSERVATION DATA FLUSHED and Near-to-Far field  n= ',this%n
@@ -2814,6 +3224,15 @@ subroutine solver_end(this)
       write(dubuf,*)'END FINAL POSTPROCESSING at n= ',this%n
       call print11(this%control%layoutnumber,dubuf)
       this%finishedwithsuccess=.true.
+
+#if defined(SEMBA_FDTD_ENABLE_ACC) || defined(SEMBA_FDTD_ENABLE_CUDA_FORTRAN)
+       if (this%gpu_initialized) then
+          call gpu_destroy_probe_buffers(this%gpu)
+          call gpu_destroy(this%gpu)
+          this%gpu_initialized = .false.
+       endif
+ #endif
+
       return
 
    end subroutine
diff --git a/src_mtln/mtl_bundle.F90 b/src_mtln/mtl_bundle.F90
index cce0acb60..6a6400ad0 100644
--- a/src_mtln/mtl_bundle.F90
+++ b/src_mtln/mtl_bundle.F90
@@ -12,6 +12,12 @@ module mtl_bundle_m
     use FDETYPES_m, only: RKIND, RKIND_TIEMPO
     implicit none
 
+    type :: external_field_segment_t
+        integer, dimension(3) :: position
+        integer :: direction = 0
+        real(kind=rkind), pointer :: field => null()
+    end type
+
     type, public :: mtl_bundle_t
         character(len=:), allocatable :: name
         real(kind=rkind), allocatable, dimension(:,:,:) :: lpul, cpul, rpul, gpul
@@ -66,12 +72,6 @@ module mtl_bundle_m
         module procedure mtldCtor
     end interface
 
-    type :: external_field_segment_t
-        integer, dimension(3) ::position
-        integer :: direction = 0
-        real(kind=rkind) , pointer  :: field => null()
-    end type
-
 contains
 
     function mtldCtor(levels, name) result(res)
diff --git a/testData/cases/nodalSource/nodalProfile.nsys-rep b/testData/cases/nodalSource/nodalProfile.nsys-rep
new file mode 100644
index 000000000..df1313ded
Binary files /dev/null and b/testData/cases/nodalSource/nodalProfile.nsys-rep differ
diff --git a/testData/cases/nodalSource/nodalProfile.sqlite b/testData/cases/nodalSource/nodalProfile.sqlite
new file mode 100644
index 000000000..a4f240266
Binary files /dev/null and b/testData/cases/nodalSource/nodalProfile.sqlite differ
diff --git a/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin b/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin
new file mode 100644
index 000000000..8820cb1bc
Binary files /dev/null and b/testData/cases/sphere/sphere.fdtd_electric_field_movie_ExC_2_2_2__77_77_77.h5bin differ
diff --git a/testData/cases/towelHanger/towelProfile.nsys-rep b/testData/cases/towelHanger/towelProfile.nsys-rep
new file mode 100644
index 000000000..7a8688f49
Binary files /dev/null and b/testData/cases/towelHanger/towelProfile.nsys-rep differ
diff --git a/testData/cases/towelHanger/towelProfile.sqlite b/testData/cases/towelHanger/towelProfile.sqlite
new file mode 100644
index 000000000..9d1ddbdf8
Binary files /dev/null and b/testData/cases/towelHanger/towelProfile.sqlite differ
diff --git a/testData/input_examples/airplane.fdtd1_tmpWarnings.txt b/testData/input_examples/airplane.fdtd1_tmpWarnings.txt
new file mode 100644
index 000000000..e69de29bb
diff --git a/testData/input_examples/airplane.fdtd_Report.txt b/testData/input_examples/airplane.fdtd_Report.txt
new file mode 100644
index 000000000..5ab9d57c3
--- /dev/null
+++ b/testData/input_examples/airplane.fdtd_Report.txt
@@ -0,0 +1,52 @@
+=========================
+semba-fdtd
+=========================
+__________________________________________
+Compilation date: May 11 2026 15:43:32
+Compiler Id: NVHPC 26.3.0
+git commit: 7cc0d31f
+cmake build type: Release
+cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -O3 -Minfo=accel -Mprefetch
+__________________________________________
+__________________________________________
+All rights reserved by the University of Granada (Spain)
+Contact person: Luis D. Angulo <lmdiazangulo@ugr.es>
+
+__________________________________________
+Compiled WITH SMBJSON support
+__________________________________________
+Launched on              11/05/2026 15:52
+INIT interpreting geometrical data from ./testData/input_examples/airplane.fdtd.json
+[OK] (    1/    1) Parser still working
+Switches ./build-cuda-rls/bin/semba-fdtd -i ./testData/input_examples/airplane.fdtd.json
+__________________________________________
+Closing warning file. Number of messages:             0
+__________________________________________
+Compiled with Single precision (real*4)
+__________________________________________
+Launched on              11/05/2026 15:52
+__________________________________________
+Launched with total options
+./build-cuda-rls/bin/semba-fdtd -i ./testData/input_examples/airplane.fdtd.json
+If later resuming use compulsory options
+mpirun -n    1
+__________________________________________
+INIT conversion internal ASCII => Binary
+__________________________________________
+__________________________________________
+__________________________________________
+Automatically correcting dt for stability reasons:
+Original dt:    1.2971875795741994E-009
+New dt:    3.0253469174290615E-010
+__________________________________________
+__________________________________________
+CFLN=    0.8000000
+__________________________________________
+__________________________________________
+Deltat=    3.0253469174290615E-010
+__________________________________________
+INIT NFDE --------> GEOM
+INIT UPDATING SHARED INFO. This process may take time!
+Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.)
+[OK] END UPDATING SHARED INFO
+(    1/    1) ERROR: gauss.exc DOES NOT EXIST
diff --git a/testData/input_examples/conformal.fdtd1_tmpWarnings.txt b/testData/input_examples/conformal.fdtd1_tmpWarnings.txt
new file mode 100644
index 000000000..e69de29bb
diff --git a/testData/input_examples/conformal.fdtd_Report.txt b/testData/input_examples/conformal.fdtd_Report.txt
new file mode 100644
index 000000000..3e6410bac
--- /dev/null
+++ b/testData/input_examples/conformal.fdtd_Report.txt
@@ -0,0 +1,47 @@
+=========================
+semba-fdtd
+=========================
+__________________________________________
+Compilation date: May 11 2026 15:50:40
+Compiler Id: NVHPC 25.9.0
+git commit: 7cc0d31f
+cmake build type: Release
+cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3
+__________________________________________
+__________________________________________
+All rights reserved by the University of Granada (Spain)
+Contact person: Luis D. Angulo <lmdiazangulo@ugr.es>
+
+__________________________________________
+Compiled WITH SMBJSON support
+__________________________________________
+Launched on              11/05/2026 15:52
+INIT interpreting geometrical data from ./testData/input_examples/conformal.fdtd.json
+[OK] (    1/    1) Parser still working
+Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/conformal.fdtd.json
+__________________________________________
+Closing warning file. Number of messages:             0
+__________________________________________
+Compiled with Single precision (real*4)
+__________________________________________
+Launched on              11/05/2026 15:52
+__________________________________________
+Launched with total options
+./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/conformal.fdtd.json
+If later resuming use compulsory options
+mpirun -n    1
+__________________________________________
+INIT conversion internal ASCII => Binary
+__________________________________________
+__________________________________________
+__________________________________________
+CFLN=    7.7888370E-02
+__________________________________________
+__________________________________________
+Deltat=    2.9999999880125916E-012
+__________________________________________
+INIT NFDE --------> GEOM
+INIT UPDATING SHARED INFO. This process may take time!
+Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.)
+[OK] END UPDATING SHARED INFO
+(    1/    1) ERROR: gauss.exc DOES NOT EXIST
diff --git a/testData/input_examples/sphere.fdtd1_tmpWarnings.txt b/testData/input_examples/sphere.fdtd1_tmpWarnings.txt
new file mode 100644
index 000000000..e69de29bb
diff --git a/testData/input_examples/sphere.fdtd_Report.txt b/testData/input_examples/sphere.fdtd_Report.txt
new file mode 100644
index 000000000..73ad25f29
--- /dev/null
+++ b/testData/input_examples/sphere.fdtd_Report.txt
@@ -0,0 +1,52 @@
+=========================
+semba-fdtd
+=========================
+__________________________________________
+Compilation date: May 11 2026 15:50:40
+Compiler Id: NVHPC 25.9.0
+git commit: 7cc0d31f
+cmake build type: Release
+cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3
+__________________________________________
+__________________________________________
+All rights reserved by the University of Granada (Spain)
+Contact person: Luis D. Angulo <lmdiazangulo@ugr.es>
+
+__________________________________________
+Compiled WITH SMBJSON support
+__________________________________________
+Launched on              11/05/2026 15:52
+INIT interpreting geometrical data from ./testData/input_examples/sphere.fdtd.json
+[OK] (    1/    1) Parser still working
+Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/sphere.fdtd.json
+__________________________________________
+Closing warning file. Number of messages:             0
+__________________________________________
+Compiled with Single precision (real*4)
+__________________________________________
+Launched on              11/05/2026 15:52
+__________________________________________
+Launched with total options
+./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/sphere.fdtd.json
+If later resuming use compulsory options
+mpirun -n    1
+__________________________________________
+INIT conversion internal ASCII => Binary
+__________________________________________
+__________________________________________
+__________________________________________
+Automatically correcting dt for stability reasons:
+Original dt:    3.8516700245905255E-011
+New dt:    3.8516662081988784E-011
+__________________________________________
+__________________________________________
+CFLN=    0.8000000
+__________________________________________
+__________________________________________
+Deltat=    3.8516662081988784E-011
+__________________________________________
+INIT NFDE --------> GEOM
+INIT UPDATING SHARED INFO. This process may take time!
+Launch with -noshared to skip this process (just relevant for structured NIBC CFCs and Anisot.)
+[OK] END UPDATING SHARED INFO
+(    1/    1) ERROR: gauss.exc DOES NOT EXIST
diff --git a/testData/input_examples/towelHanger.fdtd1_tmpWarnings.txt b/testData/input_examples/towelHanger.fdtd1_tmpWarnings.txt
new file mode 100644
index 000000000..e69de29bb
diff --git a/testData/input_examples/towelHanger.fdtd_Report.txt b/testData/input_examples/towelHanger.fdtd_Report.txt
new file mode 100644
index 000000000..352f120dd
--- /dev/null
+++ b/testData/input_examples/towelHanger.fdtd_Report.txt
@@ -0,0 +1,44 @@
+=========================
+semba-fdtd
+=========================
+__________________________________________
+Compilation date: May 11 2026 15:50:40
+Compiler Id: NVHPC 25.9.0
+git commit: 7cc0d31f
+cmake build type: Release
+cmake compilation flags: -Minfo=accel -Mpreprocess -Mbyteswapio -acc=gpu -gpu=ccnative,mem:separate -O3 -Minfo=accel -Mprefetch -acc=gpu -gpu=ccnative,mem:separate -O3
+__________________________________________
+__________________________________________
+All rights reserved by the University of Granada (Spain)
+Contact person: Luis D. Angulo <lmdiazangulo@ugr.es>
+
+__________________________________________
+Compiled WITH SMBJSON support
+__________________________________________
+Launched on              11/05/2026 15:52
+INIT interpreting geometrical data from ./testData/input_examples/towelHanger.fdtd.json
+[OK] (    1/    1) Parser still working
+Switches ./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/towelHanger.fdtd.json
+__________________________________________
+Closing warning file. Number of messages:             1
+__________________________________________
+Compiled with Single precision (real*4)
+__________________________________________
+Launched on              11/05/2026 15:52
+__________________________________________
+Launched with total options
+./build-nvhpc-rls/bin/semba-fdtd -i ./testData/input_examples/towelHanger.fdtd.json
+If later resuming use compulsory options
+mpirun -n    1
+__________________________________________
+INIT conversion internal ASCII => Binary
+__________________________________________
+__________________________________________
+__________________________________________
+CFLN=    5.1925581E-02
+__________________________________________
+__________________________________________
+Deltat=    9.9999999600419720E-013
+__________________________________________
+INIT NFDE --------> GEOM
+(    1/    1) ERROR: Current probe not found in WIRE segment         0
diff --git a/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt b/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt
new file mode 100644
index 000000000..bc2e52cb2
--- /dev/null
+++ b/testData/input_examples/towelHanger.fdtd_tmpWarnings.txt_Warnings.txt
@@ -0,0 +1 @@
+(    1/    1) ERROR: shieldedMultiwires and unshieldedMultiwires can only be defined if compiled with MTLN
diff --git a/towelHanger.exc b/towelHanger.exc
new file mode 120000
index 000000000..15b459d9b
--- /dev/null
+++ b/towelHanger.exc
@@ -0,0 +1 @@
+/home/luis/ugrfdtd/publico/tmp_cases/towelHanger/towelHanger.exc
\ No newline at end of file