ICML 2026  ·  accepted (regular)

DP‑KFC
Data‑Free Preconditioning for
Privacy‑Preserving Deep Learning

Marc Molina Van den Bosch1,2, Riccardo Taiello1, Albert Sund Aillet1, Andrea Protani1,2, Miguel Angel Gonzalez Ballester2,3, Luigi Serio1

1CERN, Geneva    2BCN MedTech, Universitat Pompeu Fabra    3ICREA

TL;DR. DP‑SGD injects isotropic noise into an anisotropic loss landscape. We fix the mismatch with a KFAC preconditioner, but instead of estimating curvature from private data (costs privacy budget) or public data (causes distribution shift), we probe the network with structured synthetic noise and recover it for free.

scroll ↓
The problem

An isotropy mismatch

Deep networks have wildly ill-conditioned loss landscapes, curvature eigenvalues span orders of magnitude across directions. DP‑SGD, on the other hand, enforces a sphere: global \(L_2\) clipping and isotropic Gaussian noise treat every direction the same.

The result: low-sensitivity parameters drown in noise while high-sensitivity directions get over-clipped, heterogeneous signal-to-noise ratios that slow convergence and leave deep models barely beating linear baselines. Second-order preconditioning could rotate and rescale gradients so the sphere finally fits the geometry, but estimating curvature under DP has always cost either privacy budget or a public proxy you may not have (think medical imaging).

Layer-wise signal-to-noise ratio: DP-SGD collapses while DP-KFC reaches an optimal profile across CNN and Transformer architectures.
Fig. 1 Layer-wise SNR on a CNN (left) and CrossViT (right), CIFAR‑100. DP‑SGD suffers signal collapse; DP‑KFC reaches a flat, optimal profile, matching data-dependent proxies without any external data.
Key idea

Curvature is mostly architecture, not data

The KFAC Fisher block of a layer factorizes into an activation covariance and a gradient covariance:

\( F_\ell \;\approx\; \underbrace{A_{\ell-1}}_{\text{input correlations}} \;\otimes\; \underbrace{G_\ell}_{\text{architectural sensitivity}} \)

Mean-field signal-propagation theory says the scale of these factors (and of the resulting preconditioner) is fixed by the architecture, widths, init variances, nonlinearities, and contracts to an architectural fixed point regardless of the inputs. So we can recover it by feeding the network unstructured noise with random labels. The only data-dependent piece is the eigenvector directions of \(A_{\ell-1}\), i.e. the correlation structure of the inputs, and natural data has a universal one: images follow \(1/f^{\alpha}\) power spectra, text follows heavy-tailed token statistics. We bake those priors into the probes (pink noise for vision, structural token sequences for language).

So: generate synthetic probes → forward/backward pass through the current model → assemble KFAC factors → invert. Zero touches of the private dataset ⇒ zero privacy cost, and no domain bias to inherit.
Sorted KFAC eigenvalues for MLP, CNN and attention layers: synthetic noise recovers the same architecture-determined decay as the private oracle.
Fig. 2 Sorted KFAC eigenvalues for (a) an MLP layer, (b) a CNN conv layer, (c) an attention QKV projection. Synthetic noise (orange) recovers the same architecture-determined decay as the private oracle (black, dashed); domain-matched public data (FashionMNIST) tracks it too, while domain-mismatched data (CIFAR‑10, purple) drifts.
Method

DP‑KFC, in three moves

1

Probe with synthetic noise

Build a batch of modality-shaped noise (pink noise / structural tokens) with random labels. Forward + backward through the current parameters \(\theta_t\) to get activations \(\tilde a_{\ell-1}\) and error signals \(\tilde\delta_\ell\).

2

Build the preconditioner

Aggregate Kronecker factors \(\hat A_{\ell-1},\hat G_\ell\), eigendecompose, and form the regularized inverse square roots \(U_{A,\ell},U_{G,\ell}\). The implicit preconditioner is \(\tilde F^{-1/2}=U_A\otimes U_G\), never materialized in full.

3

Scale, then privatize

Transform each per-sample gradient \(\tilde g_\ell = U_{G,\ell}\, g_\ell\, U_{A,\ell}\) before clipping and noise. In the preconditioned space \(\mathrm{Cov}(\tilde g)\!\approx\! I\): the fixed clip threshold applies uniformly and the DP noise is never amplified by the preconditioner.

Privacy: free, by construction

The preconditioner depends only on the architecture and the synthetic batch, both public, and on \(\theta_t\), which depends only on past privatized releases. So the \(L_2\) sensitivity of the clipped, preconditioned gradient sum is still \(C\): DP‑KFC inherits the exact RDP accounting of plain DP‑SGD. It behaves like implicit adaptive clipping, but the adaptation comes entirely from non-sensitive architectural facts.

Convergence

For non-convex objectives, DP‑KFC converges at the optimal \(\mathcal{O}(T^{-1/2})\) rate, with the privacy term \(d\sigma^2C^2/B^2\) not multiplied by the preconditioner's \(\lambda_{\max}^2\) (the failure mode of post-noise preconditioning). And because synthetic and private curvature share their eigenspectrum (Spectral Scaling Invariance), preconditioning collapses the effective condition number toward 1.

Results

Better SNR ⇒ better accuracy, across modalities

DP‑KFC with synthetic preconditioning consistently beats DP‑SGD and adaptive baselines, and matches public-data preconditioning on vision, at zero privacy cost.

Privacy-utility trade-off on MNIST and CIFAR-100.
Fig. 4 Vision. Test accuracy vs. \(\epsilon\) on MNIST (CNN) and CIFAR‑100 (CrossViT). Synthetic DP‑KFC tracks Public DP‑KFC and clears DP‑SGD across all budgets.
Privacy-utility trade-off on StackOverflow and IMDB.
Fig. 5 Language. StackOverflow (BERT) and IMDB (logistic regression). DP‑KFC improves over DP‑Adam everywhere; a residual gap to public proxies remains on StackOverflow, see limitations.

Robust to negative transfer

When a public proxy is mismatched, public-data preconditioning degrades, synthetic noise doesn't, because it carries task-agnostic geometry only. Under a texture-disjoint shift (PathMNIST ← MNIST), synthetic DP‑KFC matches the private oracle while public DP‑KFC drops 4.8 points.

Standalone & complementary

DP‑KFC instantiates the scale-then-privatize principle (acting before the noise). It beats the prior STP baseline (AdaDPS) and post-privatization methods (DP‑AdamBC, DiSK) among data-free methods, and the two families stack: DP‑KFC + DP‑AdamBC is the strongest configuration overall.

MNIST (CNN), test accuracy %, 5 seeds, hyperparameters tuned per method. Best per column in bold.
Methodε = 1ε = 2ε = 5ε = 8
DP-SGD91.7 ±0.292.5 ±0.393.4 ±0.393.7 ±0.3
AdaDPS (STP, public)91.3 ±0.893.2 ±1.093.6 ±1.393.3 ±1.4
DiSK (post-priv.)93.7 ±0.494.1 ±0.394.3 ±0.294.3 ±0.2
DP-AdamBC (post-priv.)94.0 ±0.394.8 ±0.295.2 ±0.295.3 ±0.1
Public DP-KFC95.3 ±0.495.7 ±0.396.2 ±0.396.4 ±0.3
Synthetic DP-KFC (ours, data-free)94.2 ±0.595.0 ±0.495.7 ±0.395.9 ±0.3
Synthetic DP-KFC + DP-AdamBC95.5 ±0.396.1 ±0.296.4 ±0.396.4 ±0.3
Why it works

Synthetic curvature is provably aligned

Mean-field variance contraction makes the eigenspectra of private and synthetic KFAC factors proportional in deep networks (Spectral Scaling Invariance). Preconditioning with \(\tilde F^{-1/2}\) therefore collapses the effective curvature \(\tilde F^{-1/2} F\, \tilde F^{-1/2}\) toward a near-identity scalar map, the leftover constant just gets absorbed by the clip norm and learning rate. We verify this empirically: synthetic and oracle spectra stay parallel across 8+ orders of magnitude, on MLP, CNN and attention layers.

  • Scale alignment (what the clip norm depends on) stays accurate at every depth.
  • Direction alignment is excellent in early layers but degrades in deep layers, where eigenvectors become label-dependent, a limitation for very deep nets.
  • Language gap. Random token probes hit the right architectural factors but land off the low-dimensional manifold of real text, closing this (token-frequency priors, embedding-space probes) is open work.
Tracking the private oracle through training: cosine similarity and relative Frobenius error for the first and deepest layers.
Fig. 3 Tracking the private oracle through training (MNIST CNN, \(\epsilon{=}1\)): (a) cosine similarity, direction, (b) relative Frobenius error, scale. Top: first conv layer. Bottom: deepest layer. Synthetic noise (orange) holds direction in early layers and scale at every depth, beating mismatched public data (CIFAR‑10, purple); deep-layer directions become label-dependent for every method.
Cite

BibTeX

@inproceedings{molina2026dpkfc,
  title     = {{DP-KFC}: Data-Free Preconditioning for Privacy-Preserving Deep Learning},
  author    = {Molina Van den Bosch, Marc and Taiello, Riccardo and
               Sund Aillet, Albert and Protani, Andrea and
               Gonzalez Ballester, Miguel Angel and Serio, Luigi},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}