Benchmarking Structural Nonlinearities with Interpretable Deep Learning

Posted on Sep 11, 2025

By Bayan Abusalameh

TL;DR.

Linear oscillators are simple and elegant: each mode evolves independently, superposition holds, and responses remain fully predictable. Once nonlinearities enter—whether clearance, Coulomb friction, cubic stiffness (hardening or softening), or quadratic damping—the story changes. Resonances shift, signals distort, and energy begins to leak and exchange in unexpected ways.

Our benchmark captures this transition by generating controlled SDOF simulations that span both linear and nonlinear regimes, injecting realistic noise, and labeling each sample automatically. On top of this, we train neural networks and evaluate interpretability maps to reveal not only what the model predicts, but also why.

Further Details

This post highlights the Julia-based benchmark datasets and presents only the key results. Complete simulation parameters, solver configurations, and full performance metrics (including the MATLAB-based datasets) are available in the paper.

Readers interested in the detailed equations, parameter ranges, extended results tables, and interpretability analyses can refer directly to the full paper.

1) The Linear World in 90 Seconds

Start with the single-degree-of-freedom oscillator:

$$ m\ddot{x}+c\dot{x}+kx=F(t). $$

With proportional damping, this is fully solvable:

Modes are independent
Resonance occurs at $\omega = \sqrt{k/m}$
The amplitude–frequency curve is clean and symmetric
Superposition holds

In this world, modal analysis works perfectly.

2) Where Nonlinearities Enter

In the benchmark, the nonlinear term $g_{nl}(x,\dot{x})$ takes one of five forms, each corresponding to a distinct physical mechanism:

2a. Nonlinear Restoring Forces

Non-linearity	Nonlinear Term
Clearance	piecewise stiffness (see below)
Linear System	$g_{nl}(x,\dot{x})=0$
Cubic stiffness, hardening	$g_{nl}(x)=k_cx^3,\; k_c>0$
Cubic stiffness, softening	$g_{nl}(x)=k_dx^3,\; k_d<0$
Coulomb friction	$g_{nl}(\dot{x})=\mu N_f \operatorname{sgn}(\dot{x})$
Quadratic damping	$g_{\text{nl}}(\dot{x}) = c_{nl} \dot{x}\lvert\dot{x}\rvert$

Clearance :

$$ g_{nl}(x)= \begin{cases} k_a(x-b), & x>b\\ 0, & \lvert x\rvert \leq b\\ k_a(x+b), & x<-b \end{cases} $$

This way, the governing equation is fully defined:

$$ m\ddot{x}+c\dot{x}+kx+g_{nl}(x,\dot{x})=F(t). $$

3) Forcing: Logarithmic Sine Sweep

To excite the full frequency band, we use a logarithmic sine sweep:

$$ F(t)=F_0 \sin\!\left[2\pi f_0 T \frac{\left(\frac{f_1}{f_0}\right)^{t/T}-1}{\ln\!\left(\frac{f_1}{f_0}\right)}\right], $$

with start frequency $f_0$, end frequency $f_1$, sweep time $T$, and amplitude $F_0$.

Unlike linear sweeps, this excites frequencies logarithmically—standard in vibration testing because it captures nonlinear fingerprints more clearly.

4) What It Looks Like in Data

Linear: clean resonance, symmetric FRFs, and steady envelopes
Nonlinear: bent resonance peaks, sidebands, distorted envelopes, and amplitude dependence
With noise:
- ONM19, OWM19, ONJ19, OWJ19 → SNR = 15, 20, 25 dB
- All other datasets → SNR = 10, 20, 30, 40 dB

To avoid “eyeballing,” each dataset is automatically labeled according to the governing equation.

5) The Benchmark

We built nine datasets, spanning binary and six-class classification problems, with/without noise, and solved in MATLAB and Julia (to ensure solver diversity).

Dataset	Samples	Length	Classes	Notes
LNM1, LNJ1	2,500	100k	2	Linear vs nonlinear
DM1, DJ1	14,700+	100k	6	Multiple nonlinearities
DJ19	30,000	100k	6	More sweep rates
ONM19, OWM19, ONJ19, OWJ19	1,296–5,184	100k	6	Ondra et al. datasets, with/without noise

Each sample = raw displacement time series — no FRF preprocessing.

6) Results

We evaluated performance across datasets generated with both Julia and MATLAB solvers. Here we present only the results from the Julia solvers.

Accuracy
- Binary datasets (LNJ1) → ~100%
- Multi-class DJ1, DJ19 → ~91–99%
- Hardest datasets ONJ19, OWJ19 → 65–71%
Confusion
- Cubic stiffness vs quadratic damping hardest to separate (both create amplitude-dependent shifts).
Complexity
- PCA ridgeplots: binary datasets → clean separation; ONJ19, OWJ19 → heavy overlap.

7) From Accuracy to Understanding: Interpretability

Accuracy alone is not enough: we want to know whether models are learning the right physics. Below is a short tutorial on the interpretability methods used in the benchmark, followed by what they reveal.

Let a trained model $f_\theta$ take a displacement signal $x\in \mathbb{R}^T$ and output a class score for class $c$:

$$ y_c=f_\theta(x)_c. $$

The goal of interpretability is to compute a relevance map $R^{(c)}\in \mathbb{R}^T$, where each $R_t^{(c)}$ tells us how much time step $t$ contributed to predicting class $c$.

7.1 Ante-hoc vs Post-hoc

Ante-hoc interpretability: the model is transparent by design (e.g., linear regression, decision trees). The decision process is explicit.
Post-hoc interpretability: the model (e.g., CNN, BiLSTM) is a black box, and we compute explanations afterwards. This is what we use in the benchmark, since CNNs and BiLSTMs achieve much higher accuracy on nonlinear dynamics.

7.2 Gradient-Based Methods

The simplest relevance measure is the gradient:

$$ R_t^{(c)}= \frac{\partial y_c}{\partial x_t}, $$

which tells us how sensitive the class score $y_c$ is to changes at time $t$.

Problem: this is often noisy due to gradient saturation.

Integrated Gradients (IG) improve stability by integrating along a path from a baseline input $x'$ to the actual input $x$:

$$ R_t^{(c)} = (x_t - x'_t) \int_0^1 \frac{\partial f_\theta\!\left(x' + \alpha (x - x')\right)_c}{\partial x_t} \, d\alpha. $$

This produces smoother and more reliable attributions.

7.3 Propagation-Based Methods

Instead of derivatives, these redistribute the output back through the network.

DeepLIFT: for a layer $z=w\cdot x+b$, the change in activation is decomposed as

$$ \Delta z=\sum_{i}C_{x_i\rightarrow z},\quad C_{x_i\rightarrow z}=\frac{w_i \Delta x_i}{\sum_j w_j \Delta x_j} \Delta z, $$

where $\Delta x_i= x_i-x_i'$.

DeepSHAP builds on this, enforcing Shapley consistency so that contributions are fairly distributed across inputs.

7.4 Occlusion Validation

To test whether relevance maps are meaningful, we use occlusion:

Mask a time window $[t_0,t_1]$ in the signal:

$$ x^{(\text{mask})}(t)= \begin{cases} 0,& t\in[t_0,t_1], \\ x(t), & \text{otherwise.} \end{cases} $$

Recompute the class score on the masked signal: $y_c(x^{(\text{mask})})$.
Define the confidence drop:

$$ \Delta y_c= y_c(x)-y_c(x^{(\text{mask})}). $$

If $\Delta y_c$ is large, then the occluded region truly mattered — validating the interpretability method.

By combining relevance maps with occlusion validation:

Clearance: occluding impacts → big drop in confidence
Friction: occluding stick–slip → confidence collapses
Cubic stiffness: occluding peaks → prediction fails
Quadratic damping: occluding high-velocity regions → class probability falls

This way, interpretability is not just visualization, but a quantitative check against physics.

8) CNN vs BiLSTM

We tested both CNNs and BiLSTMs.

CNNs
- Higher accuracy (~41% on ONJ19)
- Confident predictions, but occasionally overconfident in wrong cases
BiLSTMs
- Lower accuracy (~29% on ONJ19)
- More stable but underconfident predictions

Trade-off: CNNs are sharper but volatile; BiLSTMs are steadier but weaker.

9) Interpretability in Action

When applying interpretability methods:

Best performers: DeepLIFT and DeepSHAP
Why: They produce smooth, localized relevance maps around transients, where nonlinear signatures appear most strongly (e.g., clearance impacts, hysteresis loops, shifted resonances).
Human vs Machine: Relevance maps align with expert physics markers (distorted FRFs, envelope correlations). This shows the networks are not just accurate—they are learning the right features.

10) Why This Benchmark Matters

Standardized datasets for nonlinear structural dynamics
ML-ready (time series only, no FRF preprocessing)
Automatic labels tied directly to governing equations
Progressive difficulty: from simple baselines to challenging noisy cases
Interpretability validation: accuracy paired with physics-grounded explainability

This fills a gap in the community: a reproducible, transparent way to study nonlinearities with both physics and machine learning in mind.

Takeaway

Linear systems are clean, nonlinear systems are messy, and our benchmark captures this transition in a reproducible way. It provides a common ground for researchers to test, compare, and build interpretable ML methods for structural dynamics.

Accuracy alone is misleading; models must be validated against physics. Interpretability maps add transparency and trust—and the benchmark provides a reproducible way to test both, enabling fair comparisons across methods.