Journal Article

The impact of internal variability on benchmarking deep learning climate emulators

Lütjens, B., R. Ferrari, D. Watson-Parris and N. Selin (2025)
Journal of Advances in Modeling Earth Systems, 17(8) (doi: 10.1029/2024MS004619)

Source

Abstract / Summary:

Abstract

Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and data sets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We compare these deep learning emulators with a linear regression-based emulator, akin to pattern scaling, and show that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally resolved climate variables, notably surface temperature and precipitation. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. Precipitation is a much more noisy variable, and we show that deep learning emulators can overfit to internal variability noise at low frequencies, degrading their performance in comparison to a linear emulator. We address the issue of overfitting by increasing the number of climate simulations per emission pathway (from 3 to 50) and updating the benchmark targets with the respective ensemble averages from the MPI-ESM1.2-LR model. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based technique for emulating precipitation. We publish our code and data at https://github.com/blutjens/climate-emulator.

Plain Language Summary

Running a state-of-the-art climate model for a century-long future projection can take multiple weeks on the worlds largest supercomputers. Emulators are approximations of climate models that quickly compute climate forecasts when running the full climate model is computationally too expensive. Our work examines how different emulation techniques can be compared with each other. We find that a simple linear regression-based emulator can forecast local temperatures and rainfall more accurately than a complex machine learning-based emulator on a commonly used benchmark data set. It is surprising that linear regression is better for local rainfall, which is expected to be more accurately emulated by nonlinear techniques. We identify that noise from natural variations in climate, called internal variability, is one reason for the comparatively good performance of linear regression on local rainfall. This implies that addressing internal variability is necessary for assessing the performance of climate emulators. Thus, we assemble a benchmark data set with reduced internal variability and, using it, show that a deep learning-based emulator can be more accurate for emulating local rainfall, while linear regression continues to be more accurate for temperature.

Key Points

Linear regression outperforms deep learning for emulating 3 out of 4 spatial atmospheric variables in the ClimateBench benchmark
Deep learning emulators can overfit unpredictable (multi-)decadal fluctuations, when trained on a few ensemble realizations only
We recommend evaluating climate emulation techniques on large ensembles, such as the Em-MPI data subset with means over 50 realizations

Citation:

Lütjens, B., R. Ferrari, D. Watson-Parris and N. Selin (2025): The impact of internal variability on benchmarking deep learning climate emulators. Journal of Advances in Modeling Earth Systems, 17(8) (doi: 10.1029/2024MS004619) (https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2024MS004619)