REALM

Realistic AI Learning for Multiphysics: A Comprehensive Benchmark for Neural Surrogates on High-Fidelity Reactive Flows
Runze Mao1,† Rui Zhang1,† Zhi X. Chen1,3,* + 23 co-authors
1Peking University  |  2Renmin University of China  |  3AI for Science Institute  |  4University of Calgary  |  5Kyoto University  |  6FM Global  |  7LandSpace Technology  |  8Aero Engine Academy of China
Equal contribution  |  *Corresponding authors

About REALM

REALM(Realistic AI Learning for Multiphysics) is the first comprehensive benchmark dedicated to evaluating neural surrogates on realistic spatiotemporal multiphysics flows. Unlike existing benchmarks that rely on simplified, low-dimensional problems, REALM tests models on challenging, application-driven reactive flow scenarios where traditional solvers struggle.

The benchmark features 11 high-fidelity datasets spanning canonical problems, high-Mach reactive flows, propulsion engines, and fire hazards. Each trajectory requires hundreds to thousands of CPU/GPU hours, placing REALM squarely in the regime where acceleration is practically valuable.

We systematically evaluate 12+ neural surrogate families including spectral operators, convolutional models, Transformers, pointwise operators, and graph/mesh networks, revealing three critical findings:

Key Findings

  • Scaling Barrier: Performance degrades rapidly with increasing dimensionality, stiffness, and mesh irregularity
  • Architecture Matters: Inductive biases dominate performance over raw parameter count
  • Accuracy Gap: High correlation metrics don't guarantee physically trustworthy behavior
REALM Overview

Dataset Overview

IgnitHIT

Grid Size 128×128
Variables 12
Trajectories 36
Timesteps 30

Ignition kernels in homogeneous isotropic turbulence. H₂/O₂ premixed flame evolution with turbulent wrinkling and kernel merging.

EvolveJet

Grid Size 256×256
Variables 39
Trajectories 30
Timesteps 40

Time-evolving CH₄/O₂ shear jet flame with mixing-driven stabilization and strain-induced extinction/reignition.

PlanarDet

Grid Size 832×384
Variables 13
Trajectories 9
Timesteps 50

Planar cellular detonation with shock-reaction coupling and characteristic cellular structures.

ReactTGV

Grid Size 256×256×256
Variables 11
Trajectories 16
Timesteps 40

Reacting Taylor-Green vortex with flame-vortex interaction, extinction, and reignition dynamics.

PropHIT

Grid Size 1025×129×129
Variables 13
Trajectories 7
Timesteps 35

Propagating H₂-air flame in homogeneous isotropic turbulence with varying pressure and turbulence intensity.

PoolFire

Grid Size 80×80×200
Variables 9
Trajectories 15
Timesteps 21

Buoyancy-driven CH₄ pool fire with plume entrainment, puffing, and McCaffrey regime transitions.

SupCavityFlame

Cells 2988480
Variables 12
Trajectories 9
Timesteps 101

Supersonic H₂ cavity flame with shock-shear-flame interactions and recirculation stabilization.

ObstacleDet

Cells 8996482
Variables 5
Trajectories 6
Timesteps 51

Detonation diffraction around obstacle with potential decoupling and re-initiation.

SymmCoaxFlame

Cells 294900
Variables 21
Trajectories 12
Timesteps 36

Single-element CH₄/O₂ rocket combustor with shear-coaxial injector and nozzle acceleration.

MultiCoaxFlame

Cells 13478444
Variables 22
Trajectories 5
Timesteps 91

Seven-element rocket combustor with multi-injector interactions and 3D turbulent mixing.

FacadeFire

Cells 2524423
Variables 9
Trajectories 9
Timesteps 101

Building facade fire with window venting, buoyant plume, and facade-guided flame attachment.

Interactive Leaderboard

Model Family Case Correlation Test Error Params (M) Infer Time (s) Details

* Click on any row to view detailed metrics including training and validation errors.

Performance Analysis

2D Regular Cases: Scatter plot showing correlation vs. inference speed. Bubble size represents test error (smaller is better), color represents parameter count. Click on bubbles to highlight corresponding models in the leaderboard.

3D Regular Cases: Performance becomes more challenging with higher dimensionality. Models show increased error and reduced correlation compared to 2D cases.

Irregular Mesh Cases: Pointwise models (especially DeepONet) show better robustness on irregular meshes compared to spectral/convolutional operators.

Multi-Dimensional Model Comparison

Radar chart comparing all 16 benchmarked models across key performance dimensions. Larger enclosed area indicates better overall trade-off.
Metrics (normalized 0-100%): Correlation - prediction accuracy; Accuracy - error metrics; Inference Speed - throughput; Parameter Efficiency - fewer parameters is better; Memory Efficiency - lower memory footprint

Select Models to Compare:

Per-Case Performance Matrix

Each dot represents one model-case pair. Dot size encodes correlation (larger = higher correlation), color encodes test error (green = low error, red = high error).

Qualitative Results

2D Regular Cases: Quantitative Errors and Visual Comparisons

Representative snapshots showing OH mass fraction and velocity fields for IgnitHIT and EvolveJet cases, plus pressure fields for PlanarDet. FFNO consistently preserves fine-scale structures with lowest error growth.

2D Regular Results

Extended Analysis: All Models

2D Extended Results

3D Regular Cases: Quantitative Errors and Visual Comparisons

Vorticity isosurfaces for ReactTGV, temperature isosurfaces for PoolFire showing buoyancy-driven pulsation, and flame propagation structures in PropHIT. 3D cases show markedly faster error accumulation.

3D Regular Results

Extended Analysis: Correlation Evolution

3D Extended Results

Irregular Mesh Cases: Quantitative Errors and Visual Comparisons

Temperature fields for SupCavityFlame, MultiCoaxFlame rocket combustor, and FacadeFire. DeepONet shows comparative robustness on irregular meshes while graph-based models tend to over-smooth.

Irregular Results

Extended Analysis: All Irregular Cases

Irregular Extended Results

Cross-Benchmark Model Comparison

Left: Per-category scatter plots of correlation vs. inference efficiency. Middle: Radar chart of mean performance across all cases. Right: Per-case performance summary showing train/test correlation distributions.

Summary Comparison