Which RFdiffusion knobs actually matter — and what happens when you turn them?
Previously, in Part 1: We built a complete de novo binder design pipeline — RFdiffusion → ProteinMPNN → ESMFold — targeting the PD-L1 immune checkpoint, and showed that the pipeline can produce structurally validated mini binder candidates from scratch. Read the full Part 1 writeup →
That first run answered the question "can I get this to work?" This post asks the harder question: "now that it works, what do I need to pay attention to?" The goal here is not to draw definitive conclusions — at 20 backbones per condition, we don't have the statistical power for that. Instead, it's to set up a framework for answering that question. In a production setting with more compute, this same framework at higher sample sizes should yield actionable guidance for computational design campaigns. This study consumed approximately 20 GPU-hours on a T4 instance.
The same five-step workflow from Part 1. RFdiffusion generates protein backbone conformations de novo, guided by hotspot residues on the target surface. Backbones that pass a geometry filter (no clashes, reasonable inter-chain distances) are passed to ProteinMPNN, which designs an amino acid sequence for each backbone. ESMFold then predicts the 3D structure of that sequence — independently, from scratch — and we compare how closely that prediction matches the original designed backbone.
In Part 1, we used a single set of parameters — a 70-residue binder, a hydrophobic hotspot cluster, no added noise. It worked. But that single data point leaves fundamental questions unanswered: was that a lucky configuration, or a robust one? Would shorter binders be equally valid? Does the chemical character of the hotspot residues matter?
There's a pragmatic question that anyone running a computational design campaign — whether they come from enzyme engineering, antibody discovery, or assay development — eventually has to face: "How do I trust that the designs I am about to test are worth testing?" Equivalently: "Is my computational workflow reliable, or is it producing noisy designs?" To answer these questions while assembling a toolkit of computational design tools, you have to know how good each tool is and when it should be used.
This kind of instrument characterization is something experimentalists do reflexively — you run controls, you vary one thing at a time, you build intuition about the system's behavior. It's the same logic applied to a computational tool.
Three parameters were varied systematically:
Binder length (25, 50, 70, 100 residues) — Controls topological complexity. The RFdiffusion README notes that the default model tends to generate mostly helical binders.[2] At what length do we start seeing topological diversity beyond helix-turn-helix?
Hotspot configuration — RFdiffusion uses "hotspot residues" on the target surface to guide where the binder docks.[2] We defined three configurations based on our PD-L1 interface analysis (PDB 4ZQK):
Cluster A (4 residues, N-terminal end): Ala18, Thr20, Gly120, Asp122 — small and hydrophobic/polar residues at one end of the interface.
Cluster B (5 residues, central/C-terminal): Asp26, Tyr56, Arg113, Tyr123, Arg125 — aromatic and charged residues that dominate PD-1 binding energetics.
Distributed (8 residues, spanning both): Ala18, Thr20, Asp26, Tyr56, Arg113, Gly120, Asp122, Arg125 — a sparser selection spanning both clusters.
The question is whether the chemical character of these residues — not just their count or spatial arrangement — affects design quality.
Noise scale (0.0 vs 0.5) — Controls stochasticity in the diffusion process. Higher noise should produce more diverse but potentially lower-quality backbones.
The full factorial design would be 4 lengths × 3 hotspot configurations × 2 noise levels = 24 conditions. In practice, only 16 conditions produced geometry-passing backbones. The 8 missing conditions — mostly short binders (25-mers) with cluster B or distributed hotspots — yielded no designs that survived initial geometry filtering.
Conditions tested — Each condition is a unique combination of binder length, hotspot configuration, and noise scale. Each condition runs RFdiffusion once to produce 20 backbone conformations.
Designs validated — The subset of RFdiffusion backbones that passed initial geometry filtering (no steric clashes, reasonable backbone distances) and were subsequently processed through ProteinMPNN sequence design and ESMFold structure prediction.
Pass final filter — Designs that met both validation thresholds: pLDDT ≥ 80 (ESMFold is confident the sequence folds as intended) and RMSD ≤ 2.0 Å (the predicted structure closely matches the designed backbone).
Pass rate — Designs passed ÷ designs validated. A measure of how efficiently a given condition converts raw backbones into structurally validated candidates.
Before looking at individual parameters, it helps to see the overall attrition. Not every RFdiffusion backbone produces a viable candidate — the funnel narrows at each step. In this run, we observed the following:
This attrition — 480 backbones down to 31 validated candidates — is consistent with published pipelines. Bennett et al. found that structure-prediction-based filtering (using models like AlphaFold2 or ESMFold to check whether a designed sequence refolds correctly) increases experimental success rates nearly 10-fold compared to older physics-based scoring, but the computational pass rate still varies widely by target.[1] The RFdiffusion paper itself generated ~10,000 backbones per target in production campaigns.[2]
This study uses 20 backbones per condition — far fewer than a production campaign. The Baker Lab recommends ~1,000–10,000 backbones per target.[2] Our goal here is not to identify the optimal condition but to observe directional trends: which parameters appear to shift outcomes, and which seem forgiving. All observations below should be read as "in this run, we observed..." rather than definitive conclusions.
Whereas 24 conditions were tested in the full factorial design, in practice, only 16 conditions produced geometry-passing backbones. The 8 missing conditions — mostly short binders (25-mers) with cluster B or distributed hotspots — yielded no designs that survived initial geometry filtering.
That's already a result: given these inputs, short binders with many spatially distributed hotspot constraints appear to be physically overconstrained. The model simply can't satisfy all the geometric requirements simultaneously within 25 residues. This is consistent with the intuition that hotspot residue selection should consider the spatial extent of the binding interface relative to the binder length.[3]
| Condition | Length | Hotspot | Noise | Designs | Pass | Rate |
|---|
For each of the three parameters, we can ask: does varying this parameter shift the pass rate, pLDDT, or RMSD distributions in a consistent direction?
Each bubble is one design. Color indicates hotspot configuration (blue = Cluster A, red = Cluster B, amber = Distributed). Bubble size scales with binder length — the smallest bubbles are 25-mers, the largest are 100-mers. Faded bubbles failed one or both filters. Dashed lines show filter cutoffs (pLDDT ≥ 80, RMSD ≤ 2.0 Å). Designs in the top-left quadrant pass both filters.
Each cell shows pass / total designs for that condition. Grey cells indicate conditions where no backbones passed initial geometry filtering.
Several directional trends emerged consistently enough to be informative:
Cluster B produced the highest in silico pass rate in this run. The aromatic/charged peripheral hotspot patch yielded 7/8 passing designs (88%) across all lengths tested. This is consistent with the expectation that aromatic and charged residues create stronger, more geometrically specific binding constraints for RFdiffusion to optimize against. However, Cluster B also had fewer total designs reaching the validation stage, so this rate comes with low statistical power.
70-mers appeared to be a productive length. At 70 residues, 14/22 designs (64%) passed final validation — the best pass rate among lengths with substantial sample sizes. 25-mers had 2/2 (100%) but only 2 designs survived geometry filtering at all. 100-mers dropped to 11/22 (50%). This is qualitatively consistent with the observation that longer backbones have more conformational degrees of freedom, making ESMFold recapitulation harder — a phenomenon noted across multiple binder design campaigns.[1]
Moderate noise appeared to help, not hurt. At noise scale 0.5, 9/11 designs (82%) passed validation versus 22/44 (50%) at noise 0.0. We draw the tentative inference that stochasticity in the diffusion process may push the model toward more diverse and ultimately more designable backbone conformations. However, the sample sizes are unequal — noise 0.5 conditions produced fewer geometry-passing backbones in the first place, so only the "survivors" reach validation.
The 8 missing conditions are informative in themselves. Most absent conditions are short binders (25-mers) paired with distributed or Cluster B hotspots. When the spatial extent of the hotspot set exceeds what the binder backbone can physically reach, the model produces geometrically invalid outputs. This is the computational equivalent of "the protein can't fold that way" — and it aligns with published guidance on matching hotspot selection to binder length.[3]
Ranked by RMSD among designs passing both filters (pLDDT ≥ 80, RMSD ≤ 2.0 Å). The top candidate — a 70-mer from the Cluster A / noise 0.5 condition — achieved an RMSD of 0.24 Å, meaning ESMFold reproduced the designed backbone almost exactly.
| # | Condition | pLDDT | RMSD (Å) | MPNN | Rg (Å) | Contacts | Status |
|---|
None of the current computational metrics (pLDDT, RMSD, pAE, i_pTM) are predictive of binding affinity. They are useful as binary classifiers — "does this design fold as intended?" — but a high-scoring in silico design may still fail experimentally.[4][5]
It's worth placing these results in the context of the rapidly evolving binder design field. The RFdiffusion + ProteinMPNN pipeline we used here remains the most widely adopted approach,[6] but newer tools like BindCraft — which uses AlphaFold2 backpropagation for iterative sequence–structure co-optimization — have reported substantially higher experimental success rates.[4] RFdiffusion3, released in late 2025, extends the model to atom-level hotspot specification and all-atom generation.[7] The fundamental workflow — generate, filter, validate — is shared across all of them. Understanding which parameters matter in the generation step is relevant regardless of which tool you use.
The parameter study tells us which conditions produce structurally valid designs. The next page examines the designs themselves: their three-dimensional structures (viewable in an interactive 3D viewer), and how their predicted contacts compare to the native PD-1/PD-L1 interface through interaction fingerprinting.
→ Part 3: Structure & Interface Analysis