COREFL performs direct numerical simulations of compressible reactive flows on GPU trd on finite difference method.
The docs are under construction at https://corefl.readthedocs.io/en/latest/, and the current readme is just a brief introduction to the code, including the environment requirements, compilation, and running. A simple example case is also included to illustrate the usage of the code.
- CUDA compiler: supporting C++17 (nvcc > 11.0)
- C++ compiler: supporting C++20
- MPI library: supporting CUDA-aware MPI (E.g., OpenMPI > 1.8)
- CMake: supporting CUDA language
comments on the environments:
- MPI: Only a few vendors' MPI support CUDA-aware MPI, and only on Linux systems. Therefore, only Linux system supports the parallel running of COREFL. However, any MPI version supports the compilation and running in serial modes.
COREFL has been compiled and tested on Nvidia A100 GPU. The most frequently used configuration by us on A100 is given for reference: CUDA 11.8 / gcc 11.3 / openmpi 4.1.5 / cmake 3.26.3
All compilations are performed on Linux system. For Windows system, we successfully built the code with Visual Studio 2022 and CUDA, Microsoft MPI. You can also build it with CLion, but the toolchain must be the msvc instead of mingw. I would not include that here because large scale computations are always performed on Linux clusters.
The structure of the current COREFL folder is:
- docs/*
- example/*
- src/
- gxl_lib/*
- stat_lib/*
- *.cpp/.h/.cuh/.cu/.hpp
- tools/*
- CMakeLists.txt
The folder contains the COREFL codes, documentations, example cases, and some tools can be used. The concrete usage of them will be discussed throughout the documentation.
The compilation consists of the following steps:
- Navigate to the COREFL folder (the "
code" folder here). - Modify the CMakeLists.txt:
- Modify the number in
set(CMAKE_CUDA_ARCHITECTURES 60)according to the GPU compute capability. For example, this number is 60 for P100, 70 for V100, and 80 for A100. - Modify
add_compile_definitions(MAX_SPEC_NUMBER=9)according to the problem. The number should be larger than or equal to the species number to be used in computations. If no species is included, set it to 1. - Modify
add_compile_definitions(MAX_REAC_NUMBER=19)according to the problem. The number should be larger than or equal to the reaction number to be used in computations. If no reaction is included, set it to 1. - Modify
add_compile_definitions(Combustion2Part)according to the problem. If a general chemical reaction case is considered (combustion), set it toCombustion2Part. If a high-temerature air chemistry is to be simulated, set toHighTempMultiPart. Only the 5 species 6 reactions mechanism of air is supported currently when usingHighTempMultiPart.
- Modify the number in
- Load the compilation environment. For example,
module load mpi/openmpi4.1.5-gcc11.3.0-cuda11.8-ucx1.12.1 cmake/3.26.3 cmake -Bbuild -DCMAKE_BUILD_TYPE=Releasecmake --build build --parallel 16
The executable COREFL should appear in ths current folder.
The two-temperature implementation is selected at compile time. Configure with COREFL_ENABLE_TWO_TEMPERATURE=ON:
cmake -S . -B build-2t -DCMAKE_BUILD_TYPE=Release -DCOREFL_ENABLE_TWO_TEMPERATURE=ON
cmake --build build-2t --parallel 16
cp corefl corefl-2tOn GPUs newer than the default CMake architecture, pass an architecture supported by the installed nvcc, for example -DCMAKE_CUDA_ARCHITECTURES=90 with CUDA 12 on this workstation; use 120 for Blackwell only with a CUDA compiler that supports compute_120. The top-level CMakeLists.txt currently defaults to MAX_SPEC_NUMBER=9, MAX_REAC_NUMBER=19, MAX_PASSIVE_SCALAR_NUMBER=1, NASA9-capable thermodynamics, and Combustion2Part. These defaults match the provided AIR5 two-temperature example. Increase MAX_SPEC_NUMBER or MAX_REAC_NUMBER before configuring if the mechanism is larger.
Runtime constraints for the two-temperature path are enforced in src/Parameter.cpp:
bool steady = 0
int temporal_scheme = 3
int species = 1
int reaction = 1
int therm_nasa_7_9 = 9
string two_temperature_file = chemistry/two_temperature.dat
The flamelet path, steady update, dual-time stepping, and Wu splitting are intentionally blocked for now because rhoEve chemistry/VT relaxation is not wired into those update paths. In the enabled RK3 CFD path, src/Parameter.cpp forces chemSrcMethod = 0, so the chemical rhoEve source is explicit.
The added conservative variable is rhoEve, stored as scalar Eve after the species mass fractions. Eve is an independent transported specific energy; it should not be written as a function of T or Tve. The output variable Tve is auxiliary and is recovered from Eve and the local composition through the mixture modal-energy map M_ve:
eps_ve,s(Theta) = eps_vib,s(Theta) + eps_el,s(Theta)
M_ve(Theta,Y) = sum_s Y_s * eps_ve,s(Theta)
Tve = inverse_M_ve(Eve; Y), i.e. M_ve(Tve,Y) = Eve
e_tr,s(T) = h_eq,s(T) - R_s * T - eps_ve,s(T)
E = 0.5 * (u^2 + v^2 + w^2) + sum_s Y_s * e_tr,s(T) + Eve
h_tr,s(T) = h_eq,s(T) - eps_ve,s(T)
F_Eve^inv = rho * Eve * U_k
S_Eve = theta_tr:ve + sum_s omega_s * eps_ve,s(Tve)
Eve remains an independent transported scalar in the PDE and characteristic derivation. Tve = inverse_M_ve(Eve; Y) is a thermodynamic recovery step used for source terms, diffusion, restart/boundary synchronization, and output. eps_ve,s(T) is only the modal-energy value subtracted from NASA equilibrium thermodynamics to define the translational-rotational part. The conserved Eve is not eps_ve,s(T) and is not eps_ve,s(Tve).
theta_tr:ve is advanced with an exact exponential Landau-Teller update after each RK stage. The LT relaxation loop is applied to species with theta_v > 0; for those participating species the energy difference is the full eps_ve,s = eps_vib,s + eps_el,s, evaluated as eps_ve,s(T) - eps_ve,s(Tve). Atomic electronic VE without a molecular theta_v still enters conserved Eve, chemistry, diffusion, and Tve inversion through M_ve, but it does not get a separate LT relaxation time from the current metadata. The update changes rhoEve and Tve, then recloses T and p from the unchanged total energy. The chemical part sum_s omega_s eps_ve,s(Tve) is assembled with the finite-rate species source and follows the enabled RK3 explicit source path.
The viscous sign convention follows the existing species equation implementation. The physical species diffusion flux is J_s, so the species equation contains div(-J_s). COREFL stores the species viscous flux as:
D_s^code = -J_s
This is visible in src/ViscousScheme.cu: the stored species flux is proportional to +rhoD_s grad(Y_s) in the simple Fick limit, and the viscous residual adds the divergence of the stored flux. The equilibrium thermal conductivity is split in src/FieldOperation.cuh before the viscous kernels are called:
kappa_tr = kappa_eq * cp_tr / cp_eq
kappa_ve = kappa_eq * cv_ve(Tve) / cp_eq
Here cp_eq is the equilibrium mixture specific heat returned by the existing transport-property path. In code this variable is named cp_mix_tr before the split; after subtracting the equilibrium VE heat capacity, cp_tr = cp_eq - cv_ve(T).
This avoids counting the vibrational-electronic heat conduction once inside kappa_eq grad(T) and again inside kappa_ve grad(Tve). With that convention, the two-temperature viscous fluxes implemented in the solver are:
F_Eve^vis = kappa_ve * grad(Tve)
+ sum_s D_s^code * eps_ve,s(Tve)
F_E^vis = u dot tau + kappa_tr * grad(T) + kappa_ve * grad(Tve)
+ sum_s D_s^code * (h_tr,s(T) + eps_ve,s(Tve))
The nonequilibrium diffusion enthalpy helper in src/Thermo.cuh computes h_eq,s(T) - eps_ve,s(T) + eps_ve,s(Tve), which is exactly h_tr,s(T) + eps_ve,s(Tve). The 2nd-order face kernels and the 8th-order collocated kernels in src/ViscousScheme.cu use this same expression for total energy. The 8th-order scalar kernel also explicitly writes the rhoEve viscous flux into fFlux/gFlux/hFlux before the 8th-order derivative is taken, so the rhoEve derivative does not read an old or unset flux value.
The wall heat-flux post-processing in src/PostProcess.cu uses the same split:
q_tr = kappa_tr * n dot grad(T)
+ sum_s D_s^code * h_tr,s(T)
q_ve = kappa_ve * n dot grad(Tve)
+ sum_s D_s^code * eps_ve,s(Tve)
q_total = q_tr + q_ve
A useful consistency check is not to add q_ve to a total-energy diffusion term that already used h_tr,s(T) + eps_ve,s(Tve) for species diffusion; doing so would count sum_s D_s^code * eps_ve,s(Tve) twice. src/PostProcess.cu keeps q_tr and q_ve separate first, then forms q_total.
The main implementation files are:
src/Define.h,CMakeLists.txt: compile-timekTwoTemperatureswitch.src/Parameter.cpp,src/DParameter.*,src/Field.*: variable indexing,Eve,Tve, and storage.src/ChemData.*,src/Thermo.*: NASA thermodynamics split into translational-rotational and vibrational-electronic parts,two_temperature.dat, and Landau-Teller data.src/FieldOperation.*: conservative/primitive conversion,Tveinversion, total-energy closure, and exact exponential VT update.src/FiniteRateChem.cu:sum_s omega_s eps_ve,s(Tve)source and two-temperature reaction-control temperatures.src/InviscidScheme.cu,src/WENO.cu: convectiverhoEveflux and characteristic WENO projection.src/ViscousScheme.cu:Tveconduction,rhoEveviscous flux, and nonequilibrium enthalpy in total-energy diffusion.src/PostProcess.cu: wallq_tr,q_ve, andq_totalsplit.src/RK.cuh: RK3-stage VT relaxation.
The detailed characteristic derivation and equation-to-code audit are in docs/two_temperature_weno_characteristic_derivation.pdf; the LaTeX source is docs/two_temperature_weno_characteristic_derivation.tex.
To restart the provided double-wedge checkpoint for a 100-step smoke test:
mkdir -p .codex_runs/doublewedge_2t_100
rsync -a example/doubleWedge2019/input .codex_runs/doublewedge_2t_100/
mkdir -p .codex_runs/doublewedge_2t_100/output/time_series .codex_runs/doublewedge_2t_100/output/message
cp example/doubleWedge2019/output_before_doubletemp_perf_20260510_053541/time_series/flowfield_2.0425e-04s.plt \
.codex_runs/doublewedge_2t_100/output/flowfield.plt
cp example/doubleWedge2019/output_before_doubletemp_perf_20260510_053541/message/step.txt \
.codex_runs/doublewedge_2t_100/output/message/step.txtSet input/setup.txt in the run directory to initial = 1, parallel = 1, fixed_time_step = 1, dt = 5e-10, total_step = 100, and total_simulation_time later than 2.0425e-04 + 100 * 5e-10. Then run with the two-temperature executable using two MPI processes:
cd .codex_runs/doublewedge_2t_100
mpirun -np 2 /path/to/coreflThe restart file header should contain Eve and Tve; wall post-processing writes q_tr, q_ve, and q_total.
COREFL reads the structured multiblock grid files in Plot3D format. We can not partition the blocks automatically. Therefore, if you want a parallel computation with multiple blocks, you need to partition the blocks manually.
The grid files read by COREFL is not the Plot3D file outputted directly from softwares, but the ones treated by our readGrid tool.
The readGrid tool will read the "gridFile.dat" and "gridFile.inp", where "gridFile" is the filename, and output two folders grid and boundary_condition, which should be moved to the input folder and read by COREFL.
The advantage of using such a file is that the corresponding processes need only read their own blocks instead of waiting for asigning. But this may be improved or integrated in COREFL in the future.
This part is in the folder readGrid, and the compilation is straightforward and omitted here. This code can be compiled on Windows systems. A pre-compiled executable is also given in the folder if anyone uses this tool on Windows before uploading the grid to linux clusters (which is what we do).
The grid for COREFL is structured grid in Plot3D format. The Plot3D files including name.dat and name.inp contain the grid coordinates and boundary condition information, respectively.
You can generate the grid with a commercial software supporting
Plot3Dor by writing codes. For example, Pointwise and Gridgen both support this file format. In Pointwise, you choose the solver "Gridgen Generic", and by exporting CAE, you get those two files.
Note that, we do not have automatic blocking in COREFL. You need to partition the blocks manually when generating grid in softwares.
With the two files, we need a tool called readGrid to convert the Plot3D files into two folders that COREFL reads. The readGrid tool is also included in this folder, and any C++ compiler supporting C++20 works.
The tool would output two folders called grid and boundary_condition, these should be put into the input folder in the work directory.
A typical folder for running COREFL is as follows:
- input/
- grid/*
- boundary_condition/*
- chemistry/*
- setup.txt
- run.sh
- corefl(optional)
All settings are set in the setup.txt file in the following manner:
type name = value
There should be space between each two symbols.
As we always run COREFL on clusters, we need a script file to run it. The executable is optional because we can specify the path to the exectuble in the script.
In our environment, we have 4 Nvidia A100s on a node, and 4 nodes. We can write the script as follows:
#!/bin/bash
#SBATCH --job-name=case1 # name of the job
#SBATCH --nodes=2 # number of nodes to use
#SBATCH --ntasks-per-node=4 # number of tasks per nodes
#SBATCH --gres=gpu:4 # number of GPUs per nodes
#SBATCH --qos=gpugpu # included when more than 1 node is used
module purge
module load mpi/openmpi4.1.5-gcc11.3.0-cuda11.8-ucx1.12.1
### Job ID
JOB_ID="${SLURM_JOB_ID}"
### hosfile
HOSTFILE="hostfile.${JOB_ID}"
GPUS=4
for i in `scontrol show hostnames`
do
let k=k+1
host[$k]=$i
echo "${host[$k]} slots=$GPUS" >> $HOSTFILE
done
mpirun -n 8 \ # total number of processes to be started
--mca btl tcp,self \
--mca btl_tcp_if_include eth0 \
--mca pml ob1 \
--mca btl_tr_warn_component_unused 0 \
--hostfile ${HOSTFILE} \
/path/to/coreflWith the above script, the corefl will be started. An output folder will be created which contains all output files.
In the output folder, a file named flowfield.plt will be created, which is the instantaneous flowfield file. A folder called message will also be created. The flowfield file, and the message folder, are necessary for starting a computation with existing results.
The example case is example/squareCylinder2D_M10_air5_dns. It contains input/setup_2t.txt, input/chemistry/air5_park.inp, and input/chemistry/two_temperature.dat.
After building a two-temperature executable as corefl-2t, run:
cd example/squareCylinder2D_M10_air5_dns
./run_2t.shIf the executable has a different path, set COREFL_2T_BIN:
COREFL_2T_BIN=/path/to/corefl ./run_2t.shThe script regenerates the mesh, copies input/setup_2t.txt to input/setup.txt, clears previous output, and starts the executable. Important setup entries are:
int therm_nasa_7_9 = 9
string mechanism_file = chemistry/air5_park.inp
string two_temperature_file = chemistry/two_temperature.dat
bool enable_vt_relaxation = 1
real Tve = 300.0
Boundary/profile input may provide either EVE directly or Tve; if Tve is provided, COREFL computes the corresponding mixture Eve from the local species composition.
For the Mach-10 AIR5 square-cylinder case, the one-temperature calculation can lock onto a collapsed near-wall branch during startup before a detached bow shock is established. A short wall-normal blowing stage can be enabled on the wall boundary to seed the detached shock without overriding the wall pressure.
The option is configured inside the wall boundary block in setup.txt:
struct square_wall {
string type = wall
int label = 2
string thermal_type = adiabatic
int catalytic_type = 0
int if_blow_shock_wave = 1
real blow_shock_wave_coefficient = 1.0
int blow_shock_wave_start_step = 3000
int blow_shock_wave_end_step = 6000
}
The parameters are:
if_blow_shock_wave: enables the temporary wall-normal blowing when set to1; the default is0.blow_shock_wave_coefficient: multiplier for the local incoming normal velocity. The tested value for this case is1.0; a smaller value such as0.3was not enough to keep the 1T solution away from the collapsed branch.blow_shock_wave_start_stepandblow_shock_wave_end_step: inclusive RK-step window for the forcing. Withdt = 5.0e-10, the3000..6000window corresponds to1.5e-6..3.0e-6 s.
For a fresh 1T AIR5 run to 2.0e-5 s, the corresponding control settings used in the square-cylinder test were:
int total_step = 40000
real total_simulation_time = 2.0e-5
int output_time_series = 1000
On a workstation with multiple GPUs, select an idle GPU with CUDA_VISIBLE_DEVICES:
cd example/squareCylinder2D_M10_air5_dns
CUDA_VISIBLE_DEVICES=1 /path/to/coreflA useful sanity check for this setup is the centerline bow-shock standoff distance at 2.0e-5 s. In the tested AIR5 square-cylinder case, the 1T blow-only run gave about 8.55e-4 m, while the 2T reference was about 9.19e-4 m; the collapsed 1T branch stayed near 2.33e-5 m. No wall pressure override is needed for this configuration.
The reactive shock tube is presented as an example case because the grid file is small and easy to upload. The folder is as follows:
- case/1-reactiveShockTube/
- 1-grid/
- generateMeshShockTube.py
- readGrid.exe
- 2-compute/
- input/
- grid/grid 0.dat
- boundary_condition/
- boundary 0.txt
- inner 0.txt
- parallel 0.txt
- chemistry/
- therm.dat
- tran.dat
- H2_mech-PREMIX.inp
- setup.txt
- run.sh
- input/
- 1-grid/
The first step would be generating grids for the computation. Although this is a 1D case, we need to generate a 2D grid because COREFL does not support 1D. As we want to use a high-order scheme, for example, 7th-order WENO, we need to have at least 5 additional layers for the ghost layers info. Therefore, 7 layers are adopted for the y direction. And in x direction, we can set a equal dx. We offer a mesh generation file in the subfolder 1-grid. You can run the python file and the files will be generated in a folder called readGrid. You move the exe and the input.txt to the folder, and modify the input.txt as instructed by the comments. Run the exe, you will get two folders called grid and boundary_condition. Upload them to the input folder of the computational folder, you can use that grid.
We have already supply a converted grid file in the 2-compute/input/ subfolder in case you just want to run the case.
Here, such a small case, we just need 1 GPU card. Modify the run file to the following form.
#!/bin/bash
#SBATCH --job-name=ReactiveShockTube # name of the job
#SBATCH --nodes=1 # number of nodes to use
#SBATCH --ntasks-per-node=1 # number of tasks per nodes
#SBATCH --gres=gpu:1 # number of GPUs per nodes
#SBATCH --qos=gpugpu # included when more than 1 node is used
module purge
module load mpi/openmpi4.1.5-gcc11.3.0-cuda11.8-ucx1.12.1
### Job ID
JOB_ID="${SLURM_JOB_ID}"
### hosfile
HOSTFILE="hostfile.${JOB_ID}"
GPUS=4
for i in `scontrol show hostnames`
do
let k=k+1
host[$k]=$i
echo "${host[$k]} slots=$GPUS" >> $HOSTFILE
done
mpirun -n 1 \
--mca btl tcp,self \
--mca btl_tcp_if_include eth0 \
--mca pml ob1 \
--mca btl_tr_warn_component_unused 0 \
--hostfile ${HOSTFILE} \
/home/bingxing2/home/scx6d0j/GuoXL/code/corefl-cpc/corefl # change the directory to your path to corefl.With these files, and a compiled executable corefl, the test can be started.
Maybe you care about the settings, and let me explain the ones related to the current simulation in brief.
First, about the controls:
int gridIsBinary = 1 // the grid file is in ASCII (0) or binary (1).
real gridScale = 1 // In which scale is the grid generated. If the grid is generated in millimeters, the value is 0.001. In this case, we generate it in meters (1).
int total_step = 100000 // Total steps to compute
int output_file = 10000 // Frequency of outputting flowfield files
int output_screen = 1000 // Frequency of outputting residual info on screen
int output_time_series = 10000 // Frequency of outputting a flowfield named by the physical time. Because we want to compare some transient info with exsiting profiles. If this value is 0, no time series will be outputted.Next, about the temporal schemes:
bool steady = 0 // If the simulation is steady or not. Steady (1), transient (0)
real dt = 1e-9 // The physical time step in second.
real total_simulation_time = 2.4e-4 // We have data at 0.23ms, so we want the computation to stop after thatThird, about the spatial schemes:
int shock_sensor = 2 // Ducros sensor (0), modified Jameson sensor (1), sensor trd on density and pressure jump (2)
real shockSensor_threshold = -0.2 // A negative value means all points are computed by WENO scheme.
int viscous_order = 2 // inviscid(0), 2nd order (2), 8th order (8)
array int viscous_flux_tpb_2d {
16 16 1
}
array int viscous_derivative_tpb_2d {
32 16 1
}The viscous-kernel CUDA block sizes can be overridden from setup.txt with viscous_flux_tpb_2d,
viscous_flux_tpb_3d, viscous_derivative_tpb_2d, and viscous_derivative_tpb_3d. Each array must
contain three positive integers and no more than 1024 total threads. The default 2D flux block is
16 16 1, which is the safe setting for the register-heavy two-temperature 8th-order scalar viscous
flux kernel; the default 2D derivative block is 32 16 1.
Fourth, about the chemistry
int species = 1 // Air (0), multi-component simulation (1)
string mechanism_file = chemistry/H2PREMIX.inp // The path is relative to the "input/" folder.
int reaction = 1 // No reaction (0), Finite rate chemistry trd on the mechanism (1)Fifth, about the boundary conditions.
array string boundary_conditions {
wall outflow // Write all boundary conditions' names here
}
struct wall {
string type = wall // Specify the type of this bc
int label = 2 // This label must be consistent with the label of the bc when generating grid
string thermal_type = adiabatic // Thermal wall type can be 1. "adiabatic" wall; 2. "isothermal" wall
real temperature = 300 // If the wall is isothermal, the temperature should be given. As the wall in this case is adiabatic, this value will not be used.
}
struct outflow {
string type = outflow
int label = 6
}
// other info about the flow in this case
string reference_state = left // Specify the reference state for the simulation.
string default_init = left // The default initialization info for the whole flowfield
struct left {
string type = inflow
int label = 5
int inflow_type = 0 // 0 for constant inflow, 1 for profile inflow
real density = 0.072
real velocity = 0
real pressure = 7173
real u = 1
real v = 0
real w = 0
real H2 = 0.012772428
real O2 = 0.101362139
real AR = 0.885865433
}
int groups_init = 2 // Because this case needs two parts with different conditions, we use this initialize in group function of COREFL. The restriction is that the names of groups other than the "default_init" must be named as "init_cond_l", where "l" is indexed from 0 to "group_init-1".
struct init_cond_0 {
real x0 = 0.06
real x1 = 0.15
real y0 = -1
real y1 = 1
real z0 = -2
real z1 = 2
string name = right // This tells the code to find a struct named as this value
}
struct right {
string type = inflow
int label = 5
int inflow_type = 0 // 0 for constant inflow, 1 for profile inflow
real density = 0.18075
real velocity = 487.34
real pressure = 35594
real u = -1
real v = 0
real w = 0
real H2 = 0.012772428
real O2 = 0.101362139
real AR = 0.885865433
}If anyone is interested in our code in more cases, you can email me at guoxinliang@buaa.edu.cn freely. I would be happy to cooperate with you for more usages.
Below are some nonsense of mine.
The exampled one is a very simple case, but illustrates many interesting functions such as initialization in group of COREFL. We did not include the readGrid codes in this repository currently because the folder of the above case does not use that.
I have to admit that, because I am a PhD student in the fifth year, a more detailed introduction may cost more energy. If you want to use them, let us talk in private. Maybe I can give you some existing cases to save your time.
Besides, the compilation and environment issues always occur. If there are troubles, please also contact me at the email. I really want my effort of 2-3 years to be used instead of protected in my own computer.
Anyway, we have already conducted many researches trd on our GPU code, which is much faster than the CPU codes. It is really fascinating to run a DNS within several days instead of weaks or months.