Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models

1MIT, 2Harvard
TL;DR: Equilibrium Matching (EqM) exceeds Flow Matching in generation quality, supports optimization-based sampling, and solves downstream tasks naturally.

Conceptual 2D Visualization. We compare the conceptual 2D dynamics of Equilibrium Matching and Flow Matching under 2 ground truths (marked by stars). EqM learns an invariant gradient that always converges to ground truths, whereas FM learns time-conditional velocity that only converges to ground truths at t=1. We also compare the real sampling process of two methods. Under identical step sizes and number of steps, EqM converges much faster than FM.


Abstract

We introduce Equilibrium Matching (EqM), a generative modeling framework built from an equilibrium dynamics perspective. EqM discards the non-equilibrium, time-conditional dynamics in traditional diffusion and flow-based generative models and instead learns the equilibrium gradient of an implicit energy landscape. Through this approach, we can adopt an optimization-based sampling process at inference time, where samples are obtained by gradient descent on the learned landscape with adjustable step sizes, adaptive optimizers, and adaptive compute. EqM surpasses the generation performance of diffusion/flow models empirically, achieving an FID of 1.90 on ImageNet 256x256. EqM is also theoretically justified to learn and sample from the data manifold. Beyond generation, EqM is a flexible framework that naturally handles tasks including partially noised image denoising, OOD detection, and image composition. By replacing time-conditional velocities with a unified equilibrium landscape, EqM offers a tighter bridge between flow and energy-based models and a simple route to optimization-driven inference.


Equilibrium Matching

Equilibrium Matching (EqM) learns a time-invariant gradient field that is compatible with an underlying energy function, eliminating time/noise conditioning and fixed-horizon integrators. Conceptually, EqM’s gradient vanishes on the data manifold and increases toward noise, yielding an equilibrium landscape in which ground-truth samples are stationary points. Flow Matching learns a varying velocity that only converges to ground truths at the final timestep, whereas EqM learns a time-invariant gradient landscape that always converges to ground-truth data points.

To train an Equilibrium Matching model, we aim to construct an energy landscape in which the target gradient at ground-truth samples is zero. To do so, we first define a corruption scheme that provides a transition between data and noise. Our training obective aims to match a target gradient at these intermediate samples, constructing an implicit energy landscape with gradient direction pointing from noise to data.

Because of its equilibrium nature, Equilibrium Matching generates samples via optimization on the learned landscape. In contrast to diffusion/flow models that integrate over a fixed time horizon, EqM decouples sample quality from a prescribed trajectory. It formulates the sampling process as a gradient descent procedure and supports adaptive step sizes, optimizers, and compute, offering additional flexibility at inference time.


Pseudocode for EqM training and sampling.


Generation Performance

Equilibrium Matching is theoretically guaranteed to learn the data manifold and produce samples from this manifold using gradient descent. Empirically, Equilibrium Matching achieves 1.90 FID on ImageNet 256x256 generation, outperforming existing diffusion and flow-based counterparts in generation quality. Equilibrium Matching also exhibits strong scaling behavior, exceeding the flow-based counterpart at all tested scales.



We present the generation process of our EqM-XL/2 model.


Table

Class-Conditional ImageNet 256x256 Generation. EqM-XL/2 achieves a 1.90 FID, surpassing other tested methods.


To assess scalability, we vary training length, model size, and patch size. EqM scales well along all axes and consistently outperforms Flow Matching under all tested configurations. These results suggest that EqM has strong scaling potential and is a promising alternative to Flow Matching.

EqM scales across training epochs (left), parameter count (middle), and patch size (right), and outperforms Flow Matching at all tested scales by a significant margin.


Optimization-Based Sampling

EqM offers promising opportunities at inference time. Building on top of gradient descent, we can adopt existing optimization techniques in our sampling procedure. As an example, we use Nesterov Accelerated Gradient, which applies a look-ahead step at each update and evaluates the gradient at that look-ahead point. In the left figure below, we observe that NAG-GD consistently improves upon vanilla GD, and the quality gap increases as the total number of steps decreases. This aligns with our intuition that with fewer total steps, gradient descent requires more assistance to reach a desirable local minimum, making NAG more effective.

Table

Sampling with Nesterov Accelerated Gradient. NAG-GD achieves better sample quality than GD, with the gap being larger when using fewer steps.


Viewing sampling through an optimization perspective implies that the step size can be adjusted freely. For comparison, we also report the performance of the FM baseline, where we use the step size to replace the ODE update length at each step. We observe that our EqM model’s generation quality remains high across all tested step sizes, whereas FM requires a specific step size to function properly and small fluctuations lead to significantly worse performance. This suggests that EqM constructs a fundamentally different landscape than FM, which enables new sampling schemes not supported by FM models.

Table

Different Sampling Step Sizes. EqM is robust to a wide range of step sizes, whereas Flow Matching only functions properly at one specific step size.



Another advantage of gradient-based sampling is that instead of a fixed number of sampling steps, we can allocate adaptive compute per sample by stopping when the gradient norm drops below a threshold. We present the distribution of total sampling steps for 1024 samples when using adaptive compute. Adaptive compute lowers the number of function evaluations by 60% (original sampling uses 250 steps), offering promising evidence that EqM can enable new inference-time improvements.

Table

Total Steps Under Adaptive Compute. EqM assigns different numbers of steps for each sample, adaptively adjusting compute at inference time.


EqM naturally supports integration-based samplers. ODE-based diffusion samplers can be viewed as a special case of our gradient-based method by treating the velocity as a descent direction. We provide a systematic comparison between traditional integration samplers and our proposed gradient descent samplers on EqM-XL/2. Euler ODE sampler, plain gradient descent, and NAG-GD all exceed the Flow Matching baseline by a large margin, with NAG-GD performing the best.





Table


Sampler Comparison. EqM exceeds Flow Matching in performance (measured by FID) using both integration-based ODE sampler and gradient-based samplers.


Unique Properties

EqM demonstrates unique properties that traditional diffusion/flow-based models lack. By learning an equilibrium dynamic, EqM can directly start with and denoise a partially noised image. Existing diffusion/flow models require an explicit noise level as input to process partially noised images, but our EqM model does not have such a limitation. As shown below, EqM can generate high-quality samples directly from partially noised inputs, whereas flow-based models struggle and remains constant noise during sampling.


Partially Noised Image Generation. EqM successfully generates realistic reconstruction while Flow Matching fails and remains constant noise.


EqM also naturally supports the composition of multiple models by adding energy landscapes together. We test composition by combining models conditioned on different ImageNet class labels and add two conditional gradients together as the update gradient at each sampling step. Our results below demonstrate that EqM is easily composable by optimizing the summed gradient. This is similar to the composability of EBMs, while the composition of diffusion is significantly more complex to accurately implement.


Image Composition. We can compose different class-conditional EqM models by directly adding gradients together, in a similar way as EBMs.



Another unique property of the EqM model is its inherent ability to perform out-of-distribution (OOD) detection using energy value. We report the area under the ROC curve (AUROC). Compared with the baselines, EqM provides reasonable OOD detection across all tested datasets and achieves the best overall performance, suggesting that Equilibrium Matching indeed learns a valid energy landscape.

OOD Detection. EqM achieves reasonable AUROC under all tested OOD datasets and has the best average result among all tested models.


BibTeX

@article{wang2025equilibrium,
  title        = {Equilibrium Matching: Generative Modeling with Implicit Energy-Based Models},
  author       = {Wang, Runqian and Du, Yilun},
  journal      = {arXiv preprint arXiv:2510.02300},
  year         = {2025},
}