Benchmarking Structural Nonlinearities with Interpretable Deep Learning
By Bayan Abusalameh
TL;DR.
Linear oscillators are simple and elegant: each mode evolves independently, superposition holds, and responses remain fully predictable. Once nonlinearities enter—whether clearance, Coulomb friction, cubic stiffness (hardening or softening), or quadratic damping—the story changes. Resonances shift, signals distort, and energy begins to leak and exchange in unexpected ways.
Our benchmark captures this transition by generating controlled SDOF simulations that span both linear and nonlinear regimes, injecting realistic noise, and labeling each sample automatically. On top of this, we train neural networks and evaluate interpretability maps to reveal not only what the model predicts, but also why.
Further Details
This post highlights the Julia-based benchmark datasets and presents only the key results. Complete simulation parameters, solver configurations, and full performance metrics (including the MATLAB-based datasets) are available in the paper.
Readers interested in the detailed equations, parameter ranges, extended results tables, and interpretability analyses can refer directly to the full paper.
1) The Linear World in 90 Seconds
Start with the single-degree-of-freedom oscillator:
$$ m\ddot{x}+c\dot{x}+kx=F(t). $$With proportional damping, this is fully solvable:
- Modes are independent
- Resonance occurs at $\omega = \sqrt{k/m}$
- The amplitude–frequency curve is clean and symmetric
- Superposition holds
In this world, modal analysis works perfectly.
2) Where Nonlinearities Enter
In the benchmark, the nonlinear term $g_{nl}(x,\dot{x})$ takes one of five forms, each corresponding to a distinct physical mechanism:
2a. Nonlinear Restoring Forces
| Non-linearity | Nonlinear Term | |
|---|---|---|
| Clearance | piecewise stiffness (see below) | |
| Linear System | $g_{nl}(x,\dot{x})=0$ | |
| Cubic stiffness, hardening | $g_{nl}(x)=k_cx^3,\; k_c>0$ | |
| Cubic stiffness, softening | $g_{nl}(x)=k_dx^3,\; k_d<0$ | |
| Coulomb friction | $g_{nl}(\dot{x})=\mu N_f \operatorname{sgn}(\dot{x})$ | |
| Quadratic damping | $g_{\text{nl}}(\dot{x}) = c_{nl} \dot{x}\lvert\dot{x}\rvert$ |
Clearance :
$$ g_{nl}(x)= \begin{cases} k_a(x-b), & x>b\\ 0, & \lvert x\rvert \leq b\\ k_a(x+b), & x<-b \end{cases} $$This way, the governing equation is fully defined:
$$ m\ddot{x}+c\dot{x}+kx+g_{nl}(x,\dot{x})=F(t). $$3) Forcing: Logarithmic Sine Sweep
To excite the full frequency band, we use a logarithmic sine sweep:
$$ F(t)=F_0 \sin\!\left[2\pi f_0 T \frac{\left(\frac{f_1}{f_0}\right)^{t/T}-1}{\ln\!\left(\frac{f_1}{f_0}\right)}\right], $$with start frequency $f_0$, end frequency $f_1$, sweep time $T$, and amplitude $F_0$.
Unlike linear sweeps, this excites frequencies logarithmically—standard in vibration testing because it captures nonlinear fingerprints more clearly.
4) What It Looks Like in Data
- Linear: clean resonance, symmetric FRFs, and steady envelopes
- Nonlinear: bent resonance peaks, sidebands, distorted envelopes, and amplitude dependence
- With noise:
- ONM19, OWM19, ONJ19, OWJ19 → SNR = 15, 20, 25 dB
- All other datasets → SNR = 10, 20, 30, 40 dB
To avoid “eyeballing,” each dataset is automatically labeled according to the governing equation.
5) The Benchmark
We built nine datasets, spanning binary and six-class classification problems, with/without noise, and solved in MATLAB and Julia (to ensure solver diversity).
| Dataset | Samples | Length | Classes | Notes |
|---|---|---|---|---|
| LNM1, LNJ1 | 2,500 | 100k | 2 | Linear vs nonlinear |
| DM1, DJ1 | 14,700+ | 100k | 6 | Multiple nonlinearities |
| DJ19 | 30,000 | 100k | 6 | More sweep rates |
| ONM19, OWM19, ONJ19, OWJ19 | 1,296–5,184 | 100k | 6 | Ondra et al. datasets, with/without noise |
Each sample = raw displacement time series — no FRF preprocessing.
6) Results
We evaluated performance across datasets generated with both Julia and MATLAB solvers. Here we present only the results from the Julia solvers.
- Accuracy
- Binary datasets (LNJ1) → ~100%
- Multi-class DJ1, DJ19 → ~91–99%
- Hardest datasets ONJ19, OWJ19 → 65–71%
- Confusion
- Cubic stiffness vs quadratic damping hardest to separate (both create amplitude-dependent shifts).
- Complexity
- PCA ridgeplots: binary datasets → clean separation; ONJ19, OWJ19 → heavy overlap.
7) From Accuracy to Understanding: Interpretability
Accuracy alone is not enough: we want to know whether models are learning the right physics. Below is a short tutorial on the interpretability methods used in the benchmark, followed by what they reveal.
Let a trained model $f_\theta$ take a displacement signal $x\in \mathbb{R}^T$ and output a class score for class $c$:
$$ y_c=f_\theta(x)_c. $$The goal of interpretability is to compute a relevance map $R^{(c)}\in \mathbb{R}^T$, where each $R_t^{(c)}$ tells us how much time step $t$ contributed to predicting class $c$.
7.1 Ante-hoc vs Post-hoc
- Ante-hoc interpretability: the model is transparent by design (e.g., linear regression, decision trees). The decision process is explicit.
- Post-hoc interpretability: the model (e.g., CNN, BiLSTM) is a black box, and we compute explanations afterwards. This is what we use in the benchmark, since CNNs and BiLSTMs achieve much higher accuracy on nonlinear dynamics.
7.2 Gradient-Based Methods
The simplest relevance measure is the gradient:
$$ R_t^{(c)}= \frac{\partial y_c}{\partial x_t}, $$which tells us how sensitive the class score $y_c$ is to changes at time $t$.
Problem: this is often noisy due to gradient saturation.
Integrated Gradients (IG) improve stability by integrating along a path from a baseline input $x'$ to the actual input $x$:
$$ R_t^{(c)} = (x_t - x'_t) \int_0^1 \frac{\partial f_\theta\!\left(x' + \alpha (x - x')\right)_c}{\partial x_t} \, d\alpha. $$This produces smoother and more reliable attributions.
7.3 Propagation-Based Methods
Instead of derivatives, these redistribute the output back through the network.
DeepLIFT: for a layer $z=w\cdot x+b$, the change in activation is decomposed as
$$ \Delta z=\sum_{i}C_{x_i\rightarrow z},\quad C_{x_i\rightarrow z}=\frac{w_i \Delta x_i}{\sum_j w_j \Delta x_j} \Delta z, $$where $\Delta x_i= x_i-x_i'$.
DeepSHAP builds on this, enforcing Shapley consistency so that contributions are fairly distributed across inputs.
7.4 Occlusion Validation
To test whether relevance maps are meaningful, we use occlusion:
- Mask a time window $[t_0,t_1]$ in the signal:
- Recompute the class score on the masked signal: $y_c(x^{(\text{mask})})$.
- Define the confidence drop:
If $\Delta y_c$ is large, then the occluded region truly mattered — validating the interpretability method.
By combining relevance maps with occlusion validation:
- Clearance: occluding impacts → big drop in confidence
- Friction: occluding stick–slip → confidence collapses
- Cubic stiffness: occluding peaks → prediction fails
- Quadratic damping: occluding high-velocity regions → class probability falls
This way, interpretability is not just visualization, but a quantitative check against physics.
8) CNN vs BiLSTM
We tested both CNNs and BiLSTMs.
- CNNs
- Higher accuracy (~41% on ONJ19)
- Confident predictions, but occasionally overconfident in wrong cases
- BiLSTMs
- Lower accuracy (~29% on ONJ19)
- More stable but underconfident predictions
Trade-off: CNNs are sharper but volatile; BiLSTMs are steadier but weaker.
9) Interpretability in Action
When applying interpretability methods:
- Best performers: DeepLIFT and DeepSHAP
- Why: They produce smooth, localized relevance maps around transients, where nonlinear signatures appear most strongly (e.g., clearance impacts, hysteresis loops, shifted resonances).
- Human vs Machine: Relevance maps align with expert physics markers (distorted FRFs, envelope correlations). This shows the networks are not just accurate—they are learning the right features.
10) Why This Benchmark Matters
- Standardized datasets for nonlinear structural dynamics
- ML-ready (time series only, no FRF preprocessing)
- Automatic labels tied directly to governing equations
- Progressive difficulty: from simple baselines to challenging noisy cases
- Interpretability validation: accuracy paired with physics-grounded explainability
This fills a gap in the community: a reproducible, transparent way to study nonlinearities with both physics and machine learning in mind.
Takeaway
Linear systems are clean, nonlinear systems are messy, and our benchmark captures this transition in a reproducible way. It provides a common ground for researchers to test, compare, and build interpretable ML methods for structural dynamics.
Accuracy alone is misleading; models must be validated against physics. Interpretability maps add transparency and trust—and the benchmark provides a reproducible way to test both, enabling fair comparisons across methods.