Leveraging Discrete Function Decomposability for Scientific Design

JC Bowden, S Levine, J Listgarten
International Conference on Learning Representations (ICLR), 2026
[paper] [code] [email me] [copy bibtex]

Written by James Bowden, with thanks to Jenn and Sergey, Cat Glossop, Aileen Zhang, and Shreshth Srivastava for feedback.

Choose your own adventure (pick the one best suited to your expertise):

This blog post will focus on protein design for concreteness. Optional details are collapsed; without these, it should be a ~5 min read.


[optional] How can we formalize protein sequence design as an optimization problem?
Though our method can be used for any optimization problem over a discrete design space, as a concrete example, let's consider only the problem of designing a protein sequence. We can set this up as follows:
  1. The kind of discrete object that we'd like to design, \(x\), is a protein sequence of length \(L\) in which each design variable (or "position"), \(x_i\), is an amino acid from an alphabet, \(A_i\). For simplicity, we'll assume that every position uses the same standard amino acid alphabet of size 20, \(A\).
  2. Given \(L\) and \(A\), we can write the design space as \(X := A^L\), i.e., all amino acid sequences of length \(L\). It follows that there are \(\lvert A\rvert^L=20^L\) possible proteins to choose from.
  3. Specify a property function to design toward, \(f(x)\), such as binding affinity to a target, or gene editing efficiency. In practice, this may be a predictive model fit on limited assay-labeled data.
  4. Putting this all together, our design problem is to find a sequence that maximizes our specification: \(x^{\ast}=\arg\max_{x\in X} f(x)\).
[optional] How are discrete optimization problems solved with standard distributional optimization?
Distributional optimization is a way of solving design problems that can be written as \(x^{\ast}=\arg\max_{x\in X} f(x)\); estimation of distribution algorithms (EDAs) and policy optimization in reinforcement learning are two common instantiations. Compared to naively evaluating one protein, then the next, until all of \(X\) has been considered, distributional optimization algorithms navigate the design space using a probability distribution, \(p_\theta(x)\), often referred to as a "search distribution" or a "policy". Intuitively, the search distribution is like a spotlight that moves through the design space toward regions where \(f(x)\) is larger (schematic: bottom left). In modern times, \(p_\theta(x)\) is typically parameterized as a highly expressive neural network generative model, like an autoregressive model or diffusion model, allowing for pretty arbitrarily shaped spotlights. Initializing \(p_\theta(x)\).
Standard EDA pseudocode
  1. Initialize \(p_\theta(x)\)
  2. for \(N\) training iterations do
  3.     Sample \(K\) designs, \(\{x^1, \ldots, x^K\} \sim p_\theta(x)\)
  4.     Compute a weight for each sample, \(w^k=f(x^k)\)
  5.     Update \(p_\theta(x)\) via weighted maximum likelihood:
  6.         \(\theta \leftarrow \arg\max_\theta \mathbb{E}_{\{x^k\}}[w^k \log p_\theta(x^k)]\)
  7.  
  8. Sample from \(p_\theta(x)\) up to your experimental budget and test in the lab!
There's much more discussion of EDAs, their derivation, relevant hyperparameters, and the important ways they can be extended in our paper.

Decomposing the design space

A crux of many difficult design problems is that there are a huge number of possible designs—far too many to consider them all. Imagine if, however, a (particularly simple) decomposition held—yielding two completely separate, smaller design spaces representing respectively the interface amino acids and the scaffold amino acids (schematic: top right). This decomposition corresponds to a linear additive form of the property function to be maximized, $f(x_i, x_s) = f_i(x_i) + f_s(x_s)$, from which we can see that this is actually two separate optimization problems. We don't expect such clean-cut decomposability in most problems. To allow more complex forms of decomposability, we represent a decomposition as a graph in which nodes are design variables and edges denote coupling with respect to $f(x)$; the simple example corresponds to two disconnected components. Our core contribution is a search algorithm that can leverage arbitrarily complex decomposition graphs.

What do realistic decompositions look like? Let’s look at decomposition graphs derived from two proteins, AAV and CreiLOV.

In the first figure, we show one way to obtain a decomposition graph for a protein design problem—by computing a contact graph.

3D structure titles
AAV 3D a, AAV
CreiLOV 3D b, CreiLOV

Notice that the decomposition graph for AAV (right) has few edges and is relatively chain-like. This suggests that we will be able to realize a large efficiency gain by operating in its decomposed design space. On the other hand, CreiLOV looks a lot more like a fully-connected graph. In this case, we can’t expect to improve over a naive optimization method which considers all variables jointly. One could always choose a more sparsely-connected decomposition. This hints at a key tradeoff in practice.

Briefly, we can easily convert any undirected graph into a directed junction tree, which has no cycles, allowing DADO to draw inspiration from exact message-passing (next section). The corresponding junction tree for each protein is shown below.

DADO: Decomposition-Aware Distributional Optimization

To infuse distributional optimization with knowledge of a decomposition graph, our method has two core components:

First, we perform search with a generative model, \(p_\theta(x)\), factorized according to the decomposition junction tree. Each factor distribution searches a subset of design variables corresponding to a node in the tree. This factorization makes it so that DADO only "sees" the smaller decomposed space1; whereas the standard EDA searches all dimensions of \(x\) together.
Second, we coordinate the factor distributions via graph message-passing so that they can each be trained separately. The messages are called value functions and denoted \(Q_i(\tilde{x}_i, \tilde{x}_p)\). Each \(Q_i(\tilde{x}_i, \tilde{x}_p)\) describes the partial value of \(f\) on the subtree rooted at node \(i\), in expectation over its descendants' search distributions. We can use the value functions to perform a weighted maximum likelihood update to each factor distribution separately, which makes DADO more statistically efficient than the standard EDA. Below, we show a schematic comparison of a standard EDA to DADO (panel b) for a particular tree decomposition (panel a); on synthetic problems DADO consistently finds higher-\(f(x)\) designs (panel c).

DADO schematic

DADO’s optimization efficiency gain holds up for messier, real-world design problems too. Recall the two protein design problems we introduced earlier, AAV and CreiLOV, and their contrasting decomposition graphs (sparse vs. dense). Below2, we observe that DADO finds much better designs than decomposition-unaware methods for AAV design, whereas for CreiLOV, knowledge of the decomposition doesn’t help, as one would expect.

AAV results a
CreiLOV results b

What now?

  • Finding an accurate decomposition for a design problem is not always straightforward. The real world is often structured though, and even very approximate decompositions can be useful. One might try to infer decomposability from labeled data or auxiliary information, or some other creative scheme. This is an exciting research direction both for proteins and scientific design in general.
  • For concreteness, here are some examples of scientific design problems with modularity that might lend itself to decomposition.
    • In natural systems.
    • Modularity abounds in man-made systems.
  • There's no reason why DADO can't be used for optimization in continuous design spaces; we simply didn't investigate it in our paper. Everything should extend straightforwardly.

We hope you’ll read (and enjoy) our paper! If you’d like, you can return to the top and re-read from a different perspective :)
Feel free to email me with any questions, comments or feedback. I’d also be excited to discuss applying our method to your problem, or potential collaboration.


Footnotes
  1. In our paper, we include a baseline—a modernized version of the factorized distribution algorithm, or FDA—that only uses a factorization of the search distribution without the message-passing coordination. In FDA, the factorized search distribution is updated the same way as the standard EDA, with a per-sample weight, \(f(x)\), instead of a per-node weight. It's interesting that for a few problems, FDA performs as well as or better than DADO, despite its search distribution update being less statistically efficient. We suspect this is due to an additional source of variance—the sample-based approximation of DADO's value functions—which can outweigh the benefit of a per-node update. We only observed this when the junction tree nodes were relatively large, which is exactly when estimating a value function from finite samples is most difficult. It would be interesting to more carefully characterize this behavior, and one might adapt variance-reduction techniques from RL (like learned value functions) here.
  2. The y-axes between the two plots are not comparable; \(f(x)\) is protein-specific. For AAV, \(f\) represents viral viability (whether the capsid packages), and for CreiLOV, \(f\) represents fluorescence intensity. The upshot being that DADO can find better protein sequences than decomposition-unaware methods for AAV packaging, but cannot for CreiLOV fluorescence intensity. One should not conclude that the methods perform better on CreiLOV than on AAV or vice versa, as \(f\) is not comparable between them.

This site uses Just the Docs, a documentation theme for Jekyll.