From Decoupled Linear Modes to Nonlinear Mode Interaction

Posted on Sep 8, 2025

Why superposition breaks, energy starts to slosh, and what to look for in data

TL;DR. In the linear regime, each mode is an independent damped oscillator—clean, decoupled, and predictable. As amplitudes grow (or when damping/forcing aren’t “nice”), cross terms re-couple the modal equations. Near internal resonance (e.g., 1:1, 2:1, 3:1 ratios), those cross terms become near-resonant drives, and you see energy exchange, sidebands, and new frequencies. That’s “mode interaction.” Our benchmark builds controlled simulations that span both regimes and labels interaction strength automatically.


I) The linear world in 90 seconds

Start with a standard MDOF system in physical coordinates:

$$M\ddot{x} + C\dot{x} + Kx = f(t),$$

solve the generalized eigenproblem

$$K\phi_i = \omega_i^2 M\phi_i,$$

and use mass-normalized modes $\mathbf{\Phi} = [\phi_1 \ldots \phi_n]$ so

$$\mathbf{\Phi}^\top \mathbf{M} \mathbf{\Phi} = \mathbf{I}, \qquad \mathbf{\Phi}^\top \mathbf{K} \mathbf{\Phi} = \mathbf{\Omega}^2.$$

With the modal transform $\mathbf{x} = \mathbf{\Phi}\mathbf{q}$ and proportional (Rayleigh) damping $\mathbf{C} = \alpha\mathbf{M} + \beta\mathbf{K}$, the linear part decouples:

$$\ddot{q}_i + 2\zeta_i\omega_i\dot{q}_i + \omega_i^2 q_i = \Gamma_i f(t).$$

Interpretation: each $q_i$ is its own grounded SDOF oscillator. Superposition holds. No interaction.


II) Where does “mode interaction” come from?

As soon as anything non-diagonal in modal space appears, the modes re-couple:

A) Nonlinear restoring/damping (most common)

Compact 2-DOF modal:

$$\ddot{q}_i + 2\zeta_i\omega_i\dot{q}_i + \omega_i^2 q_i + \sum_{j,k} a_{ijk} q_j q_k + \sum_{j,k,\ell} b_{ijk\ell} q_j q_k q_\ell = \Gamma_i f(t), \quad i = 1, 2.$$

Terms like $a_{112}q_1 q_2$, $b_{1122}q_1 q_2^2$ are cross-terms. They vanish in the linear limit, but at finite amplitude they couple the modal equations.

B) Damping that isn’t proportional

If $\mathbf{C}$ isn’t Rayleigh, the projected $\mathbf{C}_m = \mathbf{\Phi}^\top \mathbf{C} \mathbf{\Phi}$ has off-diagonals → linear coupling even without nonlinearities.

C) Forcing/parametric effects

Multiple inputs, non-collocated actuation/sensing, time-varying stiffness, gyroscopic effects—all can couple modes.


III) Why internal resonance supercharges interaction

Consider the quadratic coupling and a 2:1 ratio $\omega_1 \approx 2\omega_2$. If $q_2 \approx A_2\cos(\omega_2 t)$, then

$$q_2^2 = \frac{A_2^2}{2}\left[1 + \cos(2\omega_2 t)\right].$$

The $\cos(2\omega_2 t)$ term drives mode 1 at frequency $2\omega_2$. When $\omega_1 \approx 2\omega_2$, that drive is near-resonant, so $q_1$ grows and draws energy from $q_2$. Symmetrically, terms like $q_1 q_2$ contain components near $\omega_2$ (since $\omega_1 - \omega_2 \approx \omega_2$), feeding energy back. Net effect: periodic energy exchange (beating)—the hallmark of interaction.

Selection rules (handy intuition):

  • Quadratic terms → strong effects at 1:1, 2:1, and create sum/difference lines.
  • Cubic terms → 1:1 backbone bending, 3:1 interactions, and combination tones.

IV) What it looks like in data

Time domain (energies).

Modal energies $E_i(t) = \frac{1}{2}(\dot{q}_i^2 + \omega_i^2 q_i^2)$.

  • Interacting: envelopes of $E_1, E_2$ swing in out of phase (phase difference $\approx \pi$) (energy sloshing).
  • Non-interacting: envelopes flat/monotone.

Frequency domain.

  • Sidebands/combination lines: components at $\omega_i \pm \omega_j$, sub/superharmonics.
  • Bicoherence: large $b^2(\omega_1, \omega_2)$ shows quadratic phase coupling.

Practical tip: you don’t need to eyeball—compute an Energy-Transfer Index (out-of-phase envelope correlation), a Sideband/Combination Score, and bicoherence. Two out of three high → interacting.


V) Common myths

  • “Modal analysis means modes never interact.” Only true for the linear part with proportional damping. Nonlinearities (or non-proportional damping/forcing) re-couple them.
  • “You need exact frequency ratios to see interaction.” Near-commensurate is enough. Detuning just sets how strong/clean the energy exchange is.
  • “You can spot interaction from time traces alone.” Sometimes, but robust diagnostics (energy envelopes, spectral features, bicoherence) are far more reliable—especially with noise.

VI) Bridging to our benchmark (what we’ll publish next)

We build a large, labeled dataset that spans both regimes:

  • Linear/near-linear cases where modes stay independent.
  • Weak/strong/deep nonlinear cases with cross-terms and near-commensurate ratios (1:1, 2:1, 3:1) so interactions actually manifest.
  • Multiple excitations (harmonic, two-tone, chirp, impulse, band-limited noise), amplitudes, and SNRs.

Automatic labels (no eyeballing):

  • ETI (energy exchange), SCS (sidebands/combination lines), BIC (bicoherence).
  • Thresholds decide non / weak / strong /deep interaction per sample.

This lets you: (1) study physics transparently, and (2) train/evaluate ML models on a clean, reproducible benchmark.


Appendix — Derivations & Worked Details

A. From Physical EOM to Modal Coordinates

A.1 Physical equations (MCK form)

$$\mathbf{M}\ddot{\mathbf{x}}(t) + \mathbf{C}\dot{\mathbf{x}}(t) + \mathbf{K}\mathbf{x}(t) + \mathbf{g}(\mathbf{x}, \dot{\mathbf{x}}) = \mathbf{f}(t), \qquad \mathbf{x} \in \mathbb{R}^n.$$

A.2 Modes and normalization

Solve the generalized eigenproblem

$$\mathbf{K}\phi_i = \omega_i^2 \mathbf{M}\phi_i, \quad i = 1, \ldots, n,$$

stack $\mathbf{\Phi} = [\phi_i, \ldots \phi_n]$, and mass-normalize:

$$\mathbf{\Phi}^\top \mathbf{M} \mathbf{\Phi} = \mathbf{I}, \qquad \mathbf{\Phi}^\top \mathbf{K} \mathbf{\Phi} = \mathbf{\Omega}^2 = \mathrm{diag}(\omega_1^2, \ldots, \omega_n^2).$$

A.3 Modal transform and projection

With $\mathbf{x} = \mathbf{\Phi}\mathbf{q}$ (so $\dot{\mathbf{x}} = \mathbf{\Phi}\dot{\mathbf{q}}$, $\ddot{\mathbf{x}} = \mathbf{\Phi}\ddot{\mathbf{q}}$):

$$\underbrace{\mathbf{\Phi}^\top \mathbf{M} \mathbf{\Phi}}_{\mathbf{I}}\ddot{\mathbf{q}} + \underbrace{\mathbf{\Phi}^\top \mathbf{C} \mathbf{\Phi}}_{\mathbf{C}_m}\dot{\mathbf{q}} + \underbrace{\mathbf{\Phi}^\top \mathbf{K} \mathbf{\Phi}}_{\mathbf{\Omega}^2}\mathbf{q} + \underbrace{\mathbf{\Phi}^\top \mathbf{g}(\mathbf{\Phi}\mathbf{q}, \mathbf{\Phi}\dot{\mathbf{q}})}_{\mathbf{N}(\mathbf{q},\dot{\mathbf{q}})} = \underbrace{\mathbf{\Phi}^\top \mathbf{f}(t)}_{\mathbf{\Gamma} f(t)}$$

A.4 Damping

If $\mathbf{C} = \alpha\mathbf{M} + \beta\mathbf{K}$ (Rayleigh),

$$\mathbf{C}_m = \alpha\mathbf{I} + \beta\mathbf{\Omega}^2 = 2\mathbf{\Xi}\mathbf{\Omega}, \quad \zeta_i = \frac{\alpha}{2\omega_i} + \frac{\beta\omega_i}{2}.$$

The linear part decouples mode-by-mode:

$$\ddot{q}_i + 2\zeta_i\omega_i\dot{q}_i + \omega_i^2 q_i + N_i(\mathbf{q}, \ddot{\mathbf{q}}) = \Gamma_i f(t).$$

(If damping is non-proportional, $\mathbf{C}_m$ has off-diagonals → linear coupling remains.)

A.5 Forcing map and participation

If $\mathbf{f}(t) = \mathbf{B}u(t)$, then modal forcing is

$$\mathbf{\Phi}^\top \mathbf{f}(t) = \mathbf{\Gamma}u(t), \quad \mathbf{\Gamma} = \mathbf{\Phi}^\top \mathbf{B},$$

whose entries $\Gamma_i$ measure how strongly the input excites each mode.

A.6 Where the modal nonlinear coefficients come from

If $\mathbf{g}$ admits a local polynomial expansion,

$$\mathbf{g}(\mathbf{x}, \dot{\mathbf{x}}) \approx \mathbf{H}^{(2)}:(\mathbf{x} \otimes \mathbf{x}) + \mathbf{H}^{(3)}\vdots(\mathbf{x} \otimes \mathbf{x} \otimes \mathbf{x}) + \ldots,$$

then with $\mathbf{x} = \mathbf{\Phi}\mathbf{q}$,

$$a_{ijk} = \phi_i^\top \mathbf{H}^{(2)}:(\phi_j \otimes \phi_k), \qquad b_{ijk\ell} = \phi_i^\top \mathbf{H}^{(3)}\vdots(\phi_j \otimes \phi_k \otimes \phi_\ell).$$

B. How Nonlinear Cross Terms Create Mode Interaction

Consider a 2-DOF modal model with quadratic/cubic terms:

$$\ddot{q}_i + 2\zeta_i\omega_i\dot{q}_i + \omega_i^2 q_i + \sum_{j,k} a_{ijk} q_j q_k + \sum_{j,k,\ell} b_{ijk\ell} q_j q_k q_\ell = \Gamma_i f(t), \quad i = 1, 2.$$

B.1 Frequency content via product-to-sum

For a quadratic cross term $q_1 q_2$ with $q_j \approx A_j\cos(\omega_j t)$,

$$q_1 q_2 = \frac{A_1 A_2}{2}\left[\cos((\omega_1 - \omega_2)t) + \cos((\omega_1 + \omega_2)t)\right].$$

→ drives components at $\omega_1 \pm \omega_2$ (sum/difference lines).

For a quadratic self term $q_2^2$,

$$q_2^2 = \frac{A_2^2}{2}\left[1 + \cos(2\omega_2 t)\right],$$

→ DC + component at $2\omega_2$.

B.2 Internal resonance “selection rules”

If $\omega_1 \approx 2\omega_2$ (2:1),

  • The $\cos(2\omega_2 t)$ in $q_2^2$ is near $\omega_1$ → it resonantly drives $q_1$.
  • The $q_1 q_2$ term contains a component near $\omega_2$ (since $\omega_1 - \omega_2 \approx \omega_2$) → feeds back into $q_2$.

Result: energy exchange between modes (beating).

B.3 Slow-flow (averaged) equations for 2:1 case (structure)

Assume small nonlinearity $\varepsilon$ and detuning $\omega_1 = 2\omega_2 + \varepsilon\sigma$. Write

$$q_1 = A_1(T)\cos(\omega_1 t + \phi_1(T)), \quad q_2 = A_2(T)\cos(\omega_2 t + \phi_2(T)), \quad T = \varepsilon t.$$

Averaging/multiple-scales yields the modulation system (schematic form):

$$\dot{A}_1 = -\zeta_1\omega_1 A_1 + \kappa_1 A_2^2 \sin\psi + \text{(drive terms)},$$$$\dot{A}_2 = -\zeta_2\omega_2 A_2 + \kappa_2 A_1 A_2 \sin\psi + \text{(drive terms)},$$$$\dot{\psi} = \sigma + \lambda_1 A_1^2 + \lambda_2 A_2^2 + \kappa_3\!\left(\frac{2A_1}{A_2} - \frac{A_2}{A_1}\right)\cos\psi + \text{(drive terms)},$$

with phase combination $\psi = 2\phi_2 - \phi_1$. The constants $\kappa_i, \lambda_i$ are algebraic functions of the nonlinear coefficients $a_{ijk}$, $b_{ijk\ell}$ and $\omega_i$.

Key takeaway: the $\sin\psi$ coupling terms are the engine of energy exchange.

(You can specialize to other ratios (1:1, 3:1) and get analogous slow flows with the appropriate phase combinations.)


C. Worked Mini-Example (No Numbers Needed)

Take the unforced, lightly damped pair:

$$\ddot{q}_1 + \omega_1^2 q_1 + \alpha q_2^2 = 0, \qquad \ddot{q}_2 + \omega_2^2 q_2 + \beta q_1 q_2 = 0, \qquad \omega_1 \approx 2\omega_2.$$

Let $q_2 \approx A_2\cos(\omega_2 t)$. Then $q_2^2$ contains $\cos(2\omega_2 t)$. If $\omega_1 = 2\omega_2$, the $q_1$-equation is a resonantly forced SDOF:

$$\ddot{q}_1 + \omega_1^2 q_1 \approx -\frac{\alpha A_2^2}{2}\cos(2\omega_2 t) \quad \text{(near resonance)}.$$

Thus $q_1$ grows (limited by damping and higher-order terms), which in turn feeds back through $\beta q_1 q_2$ to modulate $q_2$. That mutual drive is the mechanism of mode interaction.


D. Why Linear Modes “Decouple” Yet Interaction Appears

  • The modal transform diagonalizes the linear part (when damping is proportional).
  • Nonlinear terms $N_i(\mathbf{q}, \dot{\mathbf{q}})$ are not diagonal in modal space → they re-couple the modal equations.
  • At small amplitude, those terms are weak; near internal resonance, they become near-resonant and dominate the slow dynamics.
  • In strongly nonlinear regimes, the true vibration shapes are amplitude-dependent (Nonlinear Normal Modes); linear modes are just a convenient basis.

E. Diagnostics We’ll Use Later (with precise math)

E.1 Energy-Transfer Index (ETI)

Bandpass each mode around $\omega_k(\pm 5\text{–}10\%)$, compute analytic signal $z_k = q_k^{bp} + i\mathcal{H}\{q_k^{bp}\}$, envelope $A_k = |z_k|$, modal energy $E_k = A_k^2$. Define the most negative sliding correlation:

$$\text{ETI}_{12} = \min_{t \in \text{windows}} \mathrm{corr}(E_1(t), E_2(t)).$$

Interpretation: strong anti-phase (out of phase by $\pi$ rad) envelope correlation $(e.g.,\; \leq -0.5)$ $\Rightarrow$ energy exchange.

E.2 Sideband / Combination Score (SCS)

Compute STFT power $P(\omega, t)$. Around modal lines and expected nonlinear lines $\nu \in \{\omega_i \pm \omega_j,\, 2\omega_j,\, \frac{1}{2}\omega_{exc}, \ldots\}$, define

$$\text{SCS} = \max_\nu \frac{\mathrm{median}_t\, P(\nu, t)}{\max\!\left(\mathrm{median}_t\, P(\omega_\text{main}, t),\; \mathrm{median}_{\omega,t}\, P(\omega, t)\right)}.$$

Prominent nonlinear lines ($\geq 10\text{–}20\%$ of main) $\Rightarrow$ interaction.

E.3 Bicoherence (quadratic phase coupling)

With DFT $X(\omega)$ and segment averaging $\mathbb{E}[\cdot]$,

$$b^2(\omega_1, \omega_2) = \frac{\left|\mathbb{E}\{X(\omega_1)X(\omega_2)X^*(\omega_1+\omega_2)\}\right|^2}{\mathbb{E}\{|X(\omega_1)|^2\}\,\mathbb{E}\{|X(\omega_2)|^2\}\,\mathbb{E}\{|X(\omega_1+\omega_2)|^2\}}.$$

High $b^2$ at relevant pairs $\Rightarrow$ strong quadratic coupling (e.g., 2:1, sum lines).

E.4 Decision (no eyeballing)

All three above threshold $\Rightarrow$ deep; two out of three $\Rightarrow$ strong; one $\Rightarrow$ weak; none $\Rightarrow$ non.


F. Grounded vs Non-Grounded (clarifying the model)

In modal coordinates with mass normalization, each $q_i$ has a restoring term $\omega_i^2 q_i$: it’s a grounded oscillator.

Interaction does not require non-grounded (free-free) DOFs; it comes from the nonlinear cross terms (and possibly off-diagonal damping/forcing), as shown above.

Interpretability 101: How Do We Know What a Neural Network Is Looking At?

A technical walkthrough of attribution methods and how to evaluate them — assuming some comfort with math, but no prior ML background required.


The Problem: Black Boxes

Imagine you go to a doctor and they say: “You need surgery.”

You ask: “Why?”

They say: “I don’t know. That’s just what the model predicted.”

That’s the black box problem. A neural network takes in data, executes hundreds of thousands of nonlinear transformations, and produces a confident output — but the path from input to output is, for practical purposes, opaque. Even the engineers who built the model often can’t tell you which parts of the input drove a specific prediction.

Interpretability is the subfield of machine learning that tries to open that box. The core question it asks is: “Given a trained model and a specific input, which input features were most responsible for the output?”

The tools it uses to answer this are called attribution methods — and evaluating whether those tools actually work is harder than it sounds.


Running Example: Time Series Classification

To make everything concrete, we’ll work with a single example throughout. Suppose you have a model that watches a sensor signal over 10 time steps and classifies it as either normal or about to fail.

Time steps:  1    2    3    4    5    6    7    8    9    10
Signal x:   [0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2]

The model outputs: “failure — 95% confidence.”

Now we want to know: which time steps drove that prediction? That’s the job of an attribution method. It produces a relevance score for each time step:

Relevance r:[0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2]

Higher score = the model cared more about that time step. But how did we arrive at those scores? And how do we know if they’re telling the truth? Let’s go through each.


Part 1: How Attribution Maps Are Computed

Method 1 — Vanilla Gradient (Sensitivity Analysis)

Core idea: Compute the partial derivative of the model’s output with respect to each input feature. Large derivative = that feature has high leverage on the output.

$$ \phi_t = \left|\frac{\partial F(x)}{\partial x_t}\right| $$

where $F(x)$ is the model’s scalar output (e.g., the probability assigned to the “failure” class) and $x_t$ is the value at time step $t$.

This is computed in a single backward pass through the network — computationally cheap. The resulting map tells you: “If I nudge $x_t$ by a small $\epsilon$, the output changes by approximately $\phi_t \cdot \epsilon$.”

The saturation problem: Gradients measure local sensitivity — they only describe what happens at the exact point $x$. If the model is “confident but flat” in a region (i.e., the function has leveled off), the gradient will be near zero even for features the model clearly relies on. This is the gradient saturation problem, and it’s why vanilla gradients tend to produce noisy, unreliable maps in practice.


Method 2 — Integrated Gradients

Core idea: Instead of evaluating the gradient at a single point, integrate it along a straight-line path from a reference baseline $x'$ (typically the zero vector) to the actual input $x$.

$$ \phi_t^{IG} = (x_t - x'_t) \times \int_0^1 \frac{\partial F!\left(x' + \alpha(x - x')\right)}{\partial x_t} , d\alpha $$

In practice, you approximate this integral by summing gradients at $m$ evenly-spaced interpolation steps (typically $m = 50 \quad \text{to} \quad 300$):

$$ \phi_t^{IG} \approx (x_t - x't) \times \frac{1}{m} \sum{k=1}^{m} \frac{\partial F!\left(x' + \frac{k}{m}(x - x')\right)}{\partial x_t} $$

Why this matters — the Completeness Axiom: Integrated Gradients satisfies a property called completeness (also called the efficiency axiom):

$$ \sum_{t=1}^{T} \phi_t^{IG} = F(x) - F(x') $$

The attributions sum to exactly the difference between the model’s output on the actual input and its output on the baseline. This gives you a fixed “budget” of credit — every unit of importance assigned to one time step is taken from somewhere else. There’s no free lunch: a method that satisfies completeness can’t inflate scores arbitrarily.

This also fixes the saturation problem: even if the gradient is flat at $x$, the integral over the path from $x'$ to $x$ accumulates the signal through the parts of the path where the gradient was active.

Choice of baseline: The zero vector is common, but not always meaningful. A better choice is often the mean of the training set, or random samples — the baseline should represent a “neutral” input with no class signal.


Method 3 — Grad-CAM

Core idea: Instead of backpropagating all the way to the raw input, stop at the last convolutional (or recurrent) layer’s feature maps. These are richer, more semantically meaningful representations than raw pixel/time-step values.

Let $A^k \in \mathbb{R}^{T}$ be the $k$-th feature map of the last layer (for our 1D time series case). Grad-CAM computes:

Step 1 — Compute channel importance weights by global average pooling the gradients:

$$

\alpha_k = \frac{1}{T} \sum_{t=1}^{T} \frac{\partial F(x)}{\partial A^k_t} $$

Step 2 — Produce the localization map as a ReLU-weighted sum of feature maps:

$$

L^{\text{Grad-CAM}} = \text{ReLU}!\left(\sum_k \alpha_k A^k\right) $$

The ReLU keeps only the features that positively influence the predicted class. Negative contributions (features that suppress the class) are discarded. The resulting map is then upsampled back to input resolution.

Resolution trade-off: The spatial resolution of the Grad-CAM output is limited by the feature map size at the chosen layer — often much coarser than the input. Grad-CAM++ addresses this by using a weighted combination of positive partial derivatives (second-order terms) instead of simple average pooling, giving better localization when multiple instances of the same class appear.


Method 4 — SHAP (SHapley Additive exPlanations)

Core idea: Treat feature attribution as a cooperative game. Each time step is a “player,” the model’s prediction is the “payout,” and we want to fairly divide credit using Shapley values from game theory.

The Shapley value for time step $t$ is:

$$ \phi_t = \sum_{S \subseteq T \setminus {t}} \frac{|S|!,(|T|-|S|-1)!}{|T|!} \left[f(S \cup {t}) - f(S)\right] $$

where $f(S)$ is the model’s expected output when only the features in subset $S$ are observed (others are marginalized out). The sum iterates over every possible subset of features that doesn’t include $t$, weighting each by how many orderings of players would produce that coalition.

In plain terms: SHAP asks “how much does adding time step $t$ to an existing set of features change the prediction, on average across all possible orderings?”

The axiomatic foundation: Shapley values are the unique solution satisfying four axioms simultaneously:

  • Efficiency: $\sum_t \phi_t = F(x) - \mathbb{E}[F]$ (completeness over the mean baseline)
  • Symmetry: Features with identical contributions get identical attributions
  • Dummy: Features that never change the output get zero attribution
  • Linearity/Additivity: Attributions decompose linearly across additive models

The computational problem: Exact Shapley values require evaluating $f$ on all $2^{|T|}$ subsets — exponential in the number of features. For a 10-step signal that’s 1,024 evaluations; for a 1,000-step signal it’s astronomically expensive.

GradSHAP (used in the examples below) approximates Shapley values using Integrated Gradients with multiple random baselines sampled from the training data, rather than a single fixed baseline — giving more robust estimates at manageable computational cost.


Method 5 — LIME (Local Interpretable Model-Agnostic Explanations)

Core idea: Fit a simple, interpretable surrogate model (typically linear) in the local neighborhood of the input. The surrogate’s coefficients become the attribution scores.

The process:

  1. Define an interpretable representation — for time series, segment the signal into $P$ superpixels (contiguous chunks)
  2. Sample $N$ perturbed inputs $z^{(i)} \in {0,1}^P$ by randomly turning superpixels “on” or “off”
  3. Recover the actual perturbed signal $x^{(i)}$ from the binary mask and run the model: $\hat{y}^{(i)} = F(x^{(i)})$
  4. Weight samples by their proximity to the original: $w^{(i)} = \exp!\left(-d(x, x^{(i)})^2 / \sigma^2\right)$
  5. Fit a weighted LASSO regression: $\min_{\beta} \sum_i w^{(i)}(y^{(i)} - z^{(i)\top}\beta)^2 + \lambda|\beta|_1$
  6. The coefficients $\beta$ are the attribution scores for each superpixel

The instability problem: Because LIME relies on random sampling of perturbations, two runs on the same input can produce meaningfully different explanations. The choice of $\sigma$(bandwidth), $N$ (samples), and superpixel segmentation all affect the result and have no principled defaults. This makes LIME unreliable for high-stakes use cases without careful tuning and stability checks.


Method 6 — Occlusion Sensitivity

Core idea: Directly measure the causal effect of each region by masking it and observing the prediction change.

Slide a zero-mask of width $w$ across the input. At each position $p$, record the drop in prediction confidence:

$$ \phi_p^{\text{occ}} = F(x) - F(x \odot m_p) $$

where $m_p$ is a binary mask that zeros out the window centered at position $p$.

Original: [0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2] → F(x) = 0.95

Mask at t=4 (window=1):
  x ⊙ m₄ = [0.1, 0.2, 0.8, 0.0, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2] → F = 0.61
  φ₄ = 0.95 - 0.61 = 0.34  ← large drop, step 4 is important

Mask at t=9:
  x ⊙ m₉ = [0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.0, 0.2] → F = 0.93
  φ₉ = 0.95 - 0.93 = 0.02  ← negligible drop, step 9 is not important

This is the most direct measurement of feature importance — it’s not approximating any gradient, it’s measuring actual output changes. The cost is compute: you need one forward pass per mask position, making it $O(T)$ times slower than gradient methods.


Method Comparison

Method Attribution type Satisfies completeness? Resolution Stability Cost
Vanilla Gradient Local gradient Per-feature ⚠️ Noisy O(1) backward pass
Integrated Gradients Path integral Per-feature O(m) backward passes
Grad-CAM Feature map gradients Coarse (layer resolution) O(1) backward pass
SHAP (exact) Shapley values Per-feature O(2ᵀ) evaluations
GradSHAP IG + multiple baselines ≈✅ Per-feature O(m × B) passes
LIME Local linear surrogate Superpixel ⚠️ Unstable O(N) forward passes
Occlusion Direct masking Per-patch O(T) forward passes

Part 2: Evaluating Attribution Maps

Getting an attribution map is the easy part. The hard part is knowing whether it’s correct. There are two complementary evaluation paradigms.


Evaluation Paradigm 1: Human–Machine Alignment

If you have domain expert annotations, you can directly compare what the model highlights against what experts say is important. We’ll use the same running example.

Expert annotation — an engineer identifies the nonlinear zone (steps 3–7) as the critical region:

E = [0, 0, 1, 1, 1, 1, 1, 0, 0, 0]

Machine relevance scores from our attribution method:

r = [0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2]

Metric 1 — Intersection over Union (IoU)

IoU converts continuous scores to binary by thresholding at the top-τ quantile, then computes the Jaccard similarity between the machine’s highlighted set and the expert’s.

Threshold at τ = 0.20 (top 20% = top 2 time steps):

Sorted r:  [0.1, 0.2, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
θ₀.₂₀ = 80th percentile = 0.8

M_0.20:    [0,   0,   1,   1,   0,   0,   0,   0,   0,   0]
                        ↑────↑  steps 3 and 4 score ≥ 0.8
$$ \text{IoU}(E, M_\tau) = \frac{|E \cap M_\tau|}{|E \cup M_\tau|} = \frac{\sum_t \mathbb{1}[E_t=1 \wedge M_{\tau,t}=1]}{\sum_t \mathbb{1}[E_t=1 \vee M_{\tau,t}=1]} $$
E ∩ M = [0, 0, 1, 1, 0, 0, 0, 0, 0, 0]  → sum = 2
E ∪ M = [0, 0, 1, 1, 1, 1, 1, 0, 0, 0]  → sum = 5

IoU = 2/5 = 0.40

IoU ranges from 0 (no overlap) to 1 (perfect overlap). It penalizes both false positives (machine highlights something outside expert zone) and false negatives (machine misses something the expert flagged).

Important nuance: IoU depends on τ. As you increase τ (select more time steps), IoU generally rises. Reporting IoU at a single threshold is less informative than sweeping τ across a range and computing the area under the IoU-τ curve.


Metric 2 — Mass Ratio

IoU is a hard binary comparison. Mass Ratio is softer: it asks what fraction of the model’s total relevance mass falls inside the expert zone, without thresholding.

$$ \text{Mass}(E) = \frac{\sum_{t:, E_t=1} r_t^+}{\sum_{t=1}^T r_t^+} $$

where $r_t^+ = \max(r_t, 0)$ takes only the positive relevance (some methods produce signed attributions; negative relevance means the feature suppresses the predicted class).

Computing Mass(E):

Numerator — relevance in expert zone (steps 3–7):

$$

0.8 + 0.9 + 0.7 + 0.3 + 0.5 = 3.2 $$

Denominator — total positive relevance:

$$

0.1 + 0.2 + 0.8 + 0.9 + 0.7 + 0.3 + 0.5 + 0.4 + 0.6 + 0.2 = 5.7 $$

$$ \text{Mass}(E) = \frac{3.2}{5.7} = 0.561 $$

We can also compute the mass in an unimportant “steady-state” region $S$ (steps 8–10):

$$ \text{Mass}(S) = \frac{0.4 + 0.6 + 0.2}{5.7} = \frac{1.2}{5.7} = 0.211 $$

A well-calibrated model should show high Mass(E) and low Mass(S).

Why IoU and Mass can disagree: Consider a case where the model concentrates all relevance in the expert zone but the zone has 5 time steps while $τ = 0.10$ selects only 1. Mass(E) = 1.0 (perfect!) but $IoU = 0.20$ (one step out of five overlapping). The two metrics are measuring different things — Mass is about distributional concentration, IoU is about set-level agreement at a given threshold.


Metric 3 — Spearman Rank Correlation

Spearman correlation measures whether the model’s relevance ranking matches the expert’s ranking — without requiring a threshold decision.

$$ \rho_s = 1 - \frac{6 \sum_{t=1}^{T} d_t^2}{T(T^2 - 1)} $$

where $d_t$ is the difference in ranks between $r_t$ and $E_t$.

Step 1 — Rank the machine scores (rank 1 = highest relevance):

r:      [0.1, 0.2, 0.8, 0.9, 0.7, 0.3, 0.5, 0.4, 0.6, 0.2]
rank(r):[10,  8.5,  2,   1,   3,   7,   5,   6,   4,  8.5]

(Tied values at 0.2 receive the average rank: (8+9)/2 = 8.5)

Step 2 — Rank the expert annotations:

Five zeros occupy ranks 6–10 → average = 8. Five ones occupy ranks 1–5 → average = 3.

E:      [0,   0,   1,   1,   1,   1,   1,   0,   0,   0]
rank(E):[8,   8,   3,   3,   3,   3,   3,   8,   8,   8]

Step 3 — Compute rank differences and apply formula:

Step:    1     2     3     4     5     6     7     8     9    10
rank(r): 10    8.5   2     1     3     7     5     6     4    8.5
rank(E): 8     8     3     3     3     3     3     8     8    8
d:       2     0.5  -1    -2     0     4     2    -2    -4    0.5
d²:      4     0.25  1     4     0    16     4     4    16    0.25
$$ \sum d_t^2 = 49.5 $$$$ \rho_s = 1 - \frac{6 \times 49.5}{10 \times 99} = 1 - \frac{297}{990} = 0.70 $$

A Spearman correlation of 0.70 indicates a strong positive monotonic relationship — when the expert assigns importance, the model tends to assign high relevance, even if the exact scores don’t perfectly align.

Why Spearman over Pearson? Spearman makes no distributional assumption. The expert annotation is binary (a step function), while relevance scores are continuous — Pearson correlation would be distorted by the shape of those distributions. Spearman only cares about ordinal agreement.


Evaluation Paradigm 2: Corruption-Based (Faithfulness Tests)

Human annotations tell you whether the machine agrees with the expert. But they can’t tell you whether the attributed features are actually causally responsible for the model’s prediction, versus merely correlated with the expert’s view.

The corruption test (also called the faithfulness test) checks this directly: if an attribution map is telling the truth, then destroying the most-attributed time steps should damage the model’s prediction more than destroying the least-attributed ones.


The Score Drop Curve

Protocol:

  1. Rank time steps by relevance score (descending)
  2. For $k \in {5\%, 10\%, 20\%, \ldots, 100\%}$, replace the top$-k\%$ time steps with zeros (or noise)
  3. Run the corrupted input through the model; record the prediction confidence drop $\Delta S(k) = F(x) - F(\bar{x}_k)$
  4. Normalize both axes: $\tilde{S}(k) = \Delta S(k) / \max_k \Delta S(k)$ and $\tilde{N}(k) = k$
  5. Plot $\tilde{S}$ vs. $\tilde{N}$
Concrete example:
F(x) = 0.95 (original confidence)

k=10%:  F(x̄) = 0.90, ΔS = 0.05, S̃ = 0.067
k=20%:  F(x̄) = 0.80, ΔS = 0.15, S̃ = 0.200
k=50%:  F(x̄) = 0.50, ΔS = 0.45, S̃ = 0.600
k=100%: F(x̄) = 0.20, ΔS = 0.75, S̃ = 1.000  ← max drop used for normalization

A steep early rise in $\tilde{S}$ means the top-ranked time steps carry most of the predictive signal — the attribution map is faithful. A flat curve means the model doesn’t actually rely on what the method says is important.


Metric 4 — $AUC S̃_\text{top}$

Summarize the curve as an area:

$$ \text{AUC}\tilde{S}_{\text{top}} = \int_0^1 \tilde{S}(\tilde{N}) , d\tilde{N} $$

High $AUC S̃_{top}$ → method correctly identifies causally important time steps.


Metric 5 — F₁ S̃ Score

We can run the symmetric experiment: instead of corrupting the most important steps, corrupt the least important (bottom$-k\%$). A faithful method should produce very small drops here.

We combine both into an F₁-style harmonic mean:

$$ F_1\tilde{S} = \frac{\text{AUC}\tilde{S}{\text{top}} \cdot (1 - \text{AUC}\tilde{S}{\text{bottom}})}{\text{AUC}\tilde{S}{\text{top}} + (1 - \text{AUC}\tilde{S}{\text{bottom}})} $$

The analogy to precision-recall is deliberate:

  • $\text{AUC}\tilde{S}_{\text{top}}$: When we claim a step is important, does corrupting it hurt the model? (→ precision of the positive claims)
  • $1 - \text{AUC}\tilde{S}_{\text{bottom}}$: When we claim a step is unimportant, does the model survive its corruption? (→ recall of the negative claims)

A high $F₁ S̃$ requires a method to be good at identifying both what matters and what doesn’t.


Metrics 6 & 7 — Robustness via Skewness and Kurtosis

$AUC S̃_{top}$ gives us the average behavior across the test set. But a method could have high average performance while being wildly inconsistent across samples. We want to know: is the method robust?

Run the corruption experiment on $L$ different test samples. At each corruption level $k$, you get $L$ normalized score drops ${\tilde{S}^{(1)}(k), \ldots, \tilde{S}^{(L)}(k)}$. Analyze this distribution using its higher-order moments.

Skewness — asymmetry of the drop distribution:

$$ \text{skew}(k) = \frac{\kappa_3(k)}{\kappa_2(k)^{3/2}}, \quad \text{where } \kappa_i(k) = \frac{1}{L}\sum_{n=1}^L \left(\tilde{S}^{(n)}(k) - \bar{\tilde{S}}(k)\right)^i $$

A negative skew is desirable: it means the distribution has a long left tail, i.e., most samples exhibit large score drops (the method reliably identifies important features across inputs), with only a few underperforming outliers.

Negative skew (robust):        Positive skew (unreliable):
most samples here ↓                         ↓ most samples here
    ████                                ██
   ██████                              ████
  █████████                          ████████
 ████████████                    ████████████████
──────────────────           ──────────────────────
high drop  low drop          low drop        high drop

Excess Kurtosis — peakedness of the distribution:

$$ \text{Ekurt}(k) = \frac{\kappa_4(k)}{\kappa_2(k)^2} - 3 $$

Positive excess kurtosis (leptokurtic) is desirable: drops are tightly clustered, the method behaves consistently. Negative kurtosis (platykurtic) means a wide, flat distribution — unpredictable method behavior.

We summarize robustness across all corruption levels by computing areas:

$$ \overline{\text{AUC}{\text{skew}}} = 1 - \int{0.05}^{1} \text{skew}(k), dk $$$$ \text{AUC}{\text{kurt}} = \int{0.05}^{1} \text{Ekurt}(k), dk $$

We want both to be large: consistently negative skew and consistently positive kurtosis across the full range of $k$.


What Can IoU Be Low While Mass(E) Is High?

This is worth pausing on because it’s a subtlety that comes up a lot in practice.

Suppose all of the model’s relevance is concentrated in the expert zone:

E = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0]   (steps 1–5 are expert zone)
r = [5, 4, 3, 2, 1, 0, 0, 0, 0, 0]   (all mass in E, none outside)

Mass(E) = 15/15 = 1.00 — perfect. The model ignores everything outside the expert zone.

But at τ = 0.10 (top 10% = top 1 step), M_τ = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]:

$$ \text{IoU} = \frac{1}{5} = 0.20 $$

The model is perfectly concentrated where the expert says, but IoU is only 0.20 because the expert zone has 5 time steps and we’re only selecting 1. IoU penalizes the model for not covering the entire expert zone — even though all its attention is in the right place.

This reveals what each metric is actually measuring: Mass captures distributional concentration of relevance, while IoU captures set-level coverage at a specific threshold. Both matter, and they measure different failure modes.


Putting It All Together

Getting an attribution map:

Method Core mechanism Satisfies completeness? Key limitation
Vanilla Gradient Local gradient $\partial F / \partial x_t$ Gradient saturation
Integrated Gradients Path integral of gradients Baseline choice matters
Grad-CAM Gradient-weighted feature maps Coarse spatial resolution
SHAP Shapley values (game theory) Exponential compute (approximated)
LIME Local linear surrogate Unstable across runs
Occlusion Direct masking + forward pass Slow ($O(T)$ passes)

Evaluating the attribution map:

Human-alignment (external validation):

Metric What it measures Range Want
IoU Set overlap at threshold τ [0, 1] High
Mass(E) Fraction of relevance in expert zone [0, 1] High
Mass(S) Fraction of relevance in steady-state [0, 1] Low
Spearman ρ Rank-order agreement with expert [-1, 1] High

Corruption-based (internal/faithfulness validation):

Metric What it measures Want
AUC S̃_top Score drop curve under top-k corruption High
AUC S̃_bottom Score drop curve under bottom-k corruption Low
F₁ S̃ Harmonic balance of top and bottom High
AUC_skew Skewness of drops across samples, swept over k High (negative skew)
AUC_kurt Kurtosis of drops across samples, swept over k High (tight distribution)

Why This Matters

Attribution maps are not ground truth — they’re hypotheses about model behavior. A map that looks visually sensible might fail the corruption test. A map that scores well on IoU against human annotations might be exploiting spurious correlations that happen to align with expert intuitions. The evaluation framework we’ve covered is what lets you stress-test those hypotheses rigorously.

The bigger picture is this: we want AI systems that are not just accurate but auditable. Attribution methods are a first step toward that. They let you check whether a model is relying on the right features, catch models that cheat (e.g., a radiology model that learned to read scanner manufacturer metadata rather than pathology), and build the kind of trust that comes from understanding, not just performance.

We’re not there yet. Methods disagree. They can be fooled by adversarial perturbations. Evaluation benchmarks have their own limitations. But the mathematical toolkit we’ve covered — Integrated Gradients, Shapley values, IoU, Mass Ratio, Spearman correlation, and corruption-based faithfulness tests — gives us something to work with, and a principled way to hold attribution methods accountable.


Further reading: Simonyan et al. (2014) — vanilla gradients; Sundararajan et al. (2017) — Integrated Gradients and the axiomatic framework; Selvaraju et al. (2017) — Grad-CAM; Lundberg & Lee (2017) — SHAP; Ribeiro et al. (2016) — LIME; Adebayo et al. (2018) — sanity checks for saliency maps (essential reading before trusting any method).