Independent Complex Validation | The wet-lab scientist's guide to computational protein design | ← Part 4: MD Simulation

Part 5: Independent Complex Assessment with Boltz-2

When two algorithms predict binder–PD-L1 complexes independently, do they agree on the interface — and what counts as agreement?

Previously, in Parts 1–4: We built a complete design pipeline — RFdiffusion → ProteinMPNN → ESMFold — and characterized 31 mini-binder candidates through a parameter sensitivity study and full-sidechain interaction fingerprinting. Part 4 ran molecular dynamics on three representative designs to test the simulation workflow. Every structural analysis so far has relied on geometry that originated from RFdiffusion. This page asks the harder question: is the binding interface encoded in the designed sequence itself, or only in the designed structure?

28/31
Confident complexes
14/31
Pose-convergent
0.963
Best ipTMBoltz
0.69
Interface overlap
(convergent)

What is Boltz-2 and why does it matter?

So far in this project, every structural analysis has relied on geometry that originated from RFdiffusion. The backbone was designed by RFdiffusion, the sequence was designed by ProteinMPNN on that backbone, ESMFold confirmed the sequence folds into that backbone shape, and the fingerprinting and MD analyses all started from the RFdiffusion-designed complex pose. Every step carried forward the structural assumptions of the first step.

That raises a question: if the binder sequence folds correctly on its own (passing the ESMFold check), does it also form a complex with PD-L1? Single-chain fold recapitulation does not guarantee multi-chain interaction. These interactions could be transient, unstable, or non-specific since the design specifications do not explicitly optimize for binding energy.

Of course, the die-hard experimentalist's go-to approach would be: "let's test it in the lab". That approach is fine for one, or a handful of designs. But generative technology can give us an arbitrarily large number of designs, and at scale, testing in the lab without prior vetting is neither cost-effective, nor wise.

So, can we use other computational tools to independently assess whether these sequences encode plausible binder–target complexes? Turns out, we can.

Boltz-2 is an open-source biomolecular interaction model from MIT/Recursion (Passaro et al., 2025), building on Boltz-1 (Wohlwend et al., 2024). Boltz-1 was introduced as an open-source model approaching AlphaFold 3-level complex-structure prediction accuracy; Boltz-2 extends this family by jointly modeling biomolecular structures and affinities. We give it two protein sequences — binder and PD-L1 — with no structural hints. It predicts how they fold together and reports confidence metrics, particularly the interface predicted TM-score (ipTM).

In this study, we use Boltz-2 to independently predict binder–PD-L1 complexes from the ProteinMPNN-designed sequences alone. We then ask, how do these sequence-only complex predictions compare to the RFdiffusion-derived design poses?

Two paths to the same question

The schematic below shows how Boltz-2 fits into the pipeline. Path 1 (the blue track) is the design pipeline we've built across Parts 1–3 — structural information flows forward at every step, from RFdiffusion through ESMFold filtering and interaction fingerprinting. Path 2 (the amber fork) branches from ProteinMPNN, taking only the designed sequences with no structural context, and asks Boltz-2 to predict the complex independently. The two paths converge after fingerprinting: do the structurally-derived poses and the sequence-only predictions agree on the interface?

Pipeline schematic: two paths to complex validation Design brief: create novel mini-protein binders for PD-L1 PD-L1 surface 4ZQK chain A RFdiffusion Novel backbone ProteinMPNN Sequence design Path 1 — structural information carried forward ESMFold Fold check Filter + rank 31 candidates Fingerprinting Interface contacts Sequences only No structural hints Path 2 — sequence to structure from scratch Boltz-2 Independent prediction Do they agree? Compare interfaces 28 of 31 Boltz-supported MD + affinity scoring Parts 6–7 (next)

Figure 1. Two paths to complex validation. The design pipeline (Path 1, blue) carries structural information from RFdiffusion through ESMFold filtering and interaction fingerprinting. Boltz-2 (Path 2, amber) starts from ProteinMPNN sequences alone with no structural context. The two paths converge after fingerprinting: agreement provides orthogonal evidence that the binding interface is encoded in the designed sequence, not only in the designed structure. MD simulation and affinity scoring (dashed) follow downstream on the validated set.

How we ran this check

Step 1
Prepare inputs
Pair each binder sequence with PD-L1 — no structures, just sequences
Step 2
Run Boltz-2
Predict complex structure from sequence alone on Colab T4 GPU
Step 3
Extract confidence
ipTMBoltz, pLDDTBoltz, pTMBoltz, and PDEBoltz per design
Step 4
Classify tiers
Validated, uncertain, or failed based on confidence thresholds
Step 5
Compare poses
Align PD-L1, measure binder displacement and interface residue overlap

Steps 1–4 answer: does an independent model agree that these are plausible complexes? Step 5 answers the harder question: does Boltz-2 find the same binding mode, and do the same residues mediate the interface? Agreement at the confidence level (Steps 1–4) is necessary but not sufficient — the real test is whether both algorithms converge on the same structural interface.

What Boltz-2 measures

MetricWhat it meansThreshold
ipTMBoltzInterface predicted TM-score — a model-internal confidence metric for the predicted inter-chain arrangement. High interface confidence is useful for triage, but should be interpreted as model confidence rather than binding affinity.≥ 0.6
pLDDTBoltzPer-residue confidence averaged across the complex. Indicates overall fold quality.≥ 0.7
complex PDEPredicted distance error across inter-chain residue pairs. Lower values indicate better interface geometry.Lower

What Boltz-2 is NOT used for here: Boltz-2 includes an affinity-prediction module, but we do not use it here. The released affinity workflow is intended for protein–ligand affinity prediction, not protein–protein affinity. This page uses only structure-confidence and interface-agreement metrics.

Results

Terminology. We use Boltz-supported for designs with ipTMBoltz ≥ 0.6, uncertain for designs where individual chains fold but interface confidence is low, and pose-convergent for designs with binder Cα RMSD < 5 Å after PD-L1 alignment between the RFdiffusion and Boltz-2 structures. Pose-divergent means RMSD ≥ 15 Å.

The data: side-by-side structure predictions

Before interpreting metrics, it helps to see what the two algorithms actually produce. Each card below shows one of the 31 designs. Toggle between the RFdiffusion-designed pose (Path 1) and the Boltz-2 predicted complex (Path 2). PD-L1 is shown in green, the binder in blue. Cards are sorted by ipTMBoltz and show both Boltz confidence metrics and pose-agreement metrics. Spend a moment toggling through a few designs — some predictions are strikingly similar, others clearly different. The analysis that follows quantifies what your eye is already seeing.

🟢 PD-L1 target (chain A)    🔵 Designed binder (chain B)    Cartoon representation

Layer 1: Boltz-2 confidence

28 of 31 designs produce confident Boltz-2 complexes. ipTMBoltz ≥ 0.6 for 28 designs, with the top-scoring design at 0.963. Three designs fell in the uncertain zone (ipTMBoltz 0.20–0.41). Zero designs failed outright. Importantly, the five highest-confidence designs are not merely high-ipTM predictions — they all show strong pose convergence with RMSD < 3 Å and paratope Jaccard > 0.6 (detailed in Layer 2 below).

Confidence by hotspot configuration

Cluster A designs scored highest on average (mean ipTMBoltz 0.821), followed by Cluster B (0.772) and Distributed (0.667). This is consistent with a plausible structural interpretation: concentrated hotspot clusters give Boltz a clearer docking signal because the binder has a well-defined binding patch, while distributed hotspots spread the interface thinner and make the docking geometry less deterministic for the prediction model.

ipTMBoltz by hotspot configuration

ipTMBoltz by binder length

Confidence by binder length

The 25-mer and 70-mer designs showed similar mean ipTMBoltz (~0.80), while 100-mers had the widest variance (min 0.199, max 0.941). Longer binders have more possible docking orientations, which may make Boltz's prediction task harder — but the best 100-mers scored nearly as high as the best 70-mers.

ipTMBoltz vs pLDDTBoltz

Figure 2. ipTMBoltz vs pLDDTBoltz for all 31 designs. The validated zone (upper right) contains 28 designs. The three uncertain designs have high pLDDTBoltz (the individual chains fold well) but lower ipTMBoltz (Boltz is less confident about the interface).

These confidence metrics tell us that Boltz-2 believes these sequences encode plausible binder–PD-L1 complexes. But confidence alone doesn't tell us whether Boltz found the same complex as RFdiffusion — only that it found a complex. The next two layers ask the harder question.

Layer 2: Does Boltz-2 recover the same binding pose?

Confidence tells us Boltz believes a complex exists. But does it find the same complex? To test this, we aligned PD-L1 between the RFdiffusion-designed threaded complex and the Boltz-2 predicted complex, then measured binder Cα RMSD after alignment. If both algorithms place the binder in the same position on PD-L1, this RMSD should be low.

14 of 31 designs show strong pose agreement (binder RMSD < 5 Å), with a mean ipTMBoltz of 0.87. 12 designs show a different geometric binding mode (RMSD ≥ 15 Å) — the binder sits in a different orientation on PD-L1, despite Boltz being confident about its prediction. Five designs fall in between (5–15 Å).

ipTMBoltz vs binder RMSDpose

ipTMBoltz vs binder paratope Jaccard

Figure 3. Pose agreement between RFdiffusion and Boltz-2 predictions. Left: binder Cα RMSD after PD-L1 alignment — lower means the binder is in the same position. Right: binder paratope Jaccard overlap — higher means the same binder residues are at the interface. Points colored by hotspot configuration. ipTMBoltz correlates with both metrics (ρ = −0.54 for RMSD, ρ = +0.60 for Jaccard, both p < 0.001), but high ipTM does not guarantee pose agreement.

This is a critical result: high ipTMBoltz does not guarantee the same binding mode. Several designs with ipTM > 0.8 have binder RMSD > 20 Å — Boltz is confident about a complex, just a different one than RFdiffusion designed. In small-molecule docking, a high-confidence but wrong pose is often immediately problematic because the pose defines the interaction geometry. For protein-protein interfaces, the interpretation can be more nuanced: are the same residues mediating contact, regardless of the overall binder orientation?

Layer 3: Are the same residues at the interface?

Even when the binder approaches PD-L1 from a different angle, the functionally relevant question is whether the same binder residues engage the same target surface. We computed three residue-level overlap metrics using Cα contacts at 8 Å: binder paratope Jaccard, target epitope Jaccard, and hotspot recovery.

What is a Jaccard index? The Jaccard index measures how much two sets overlap, on a scale from 0 (no overlap) to 1 (identical sets). If the threaded complex contacts PD-L1 residues {A, B, C, D} and the Boltz complex contacts {B, C, D, E}, the shared set is {B, C, D} and the combined set is {A, B, C, D, E}. The Jaccard index is 3/5 = 0.60. We apply this separately to binder-side residues (the paratope — which parts of the binder are at the interface) and target-side residues (the epitope — which parts of PD-L1 are contacted).
0.69
Paratope Jaccard
(convergent)
0.71
Epitope Jaccard
(convergent)
0.29
Epitope Jaccard
(divergent)
58%
Hotspots contacted
(either pose)

For the 14 convergent designs (RMSD < 5 Å), residue-level overlap is strong: binder paratope Jaccard averages 0.69 and target epitope Jaccard averages 0.71. Both algorithms agree on which face of the binder contacts which surface of PD-L1.

For the 12 divergent designs (RMSD ≥ 15 Å), the residue overlap drops — but not to zero. Target epitope Jaccard averages 0.29, meaning about a third of the PD-L1 contact residues are shared even when the binder approaches from a different angle. And the fraction of hotspot residues contacted in either pose holds at 58% across all three RMSD buckets. The hotspot residues are being engaged; the two algorithms disagree on how the binder reaches them.

Target epitope Jaccard vs binder RMSDpose

Hotspot recovery vs binder RMSDpose

Figure 4. Residue-level interface agreement vs geometric pose divergence. Left: target epitope Jaccard decreases with RMSD but remains above zero even for highly divergent poses — the binder contacts partially overlapping PD-L1 residues from a different approach angle. Right: fraction of canonical hotspot residues contacted in both poses. One design (len70 cA d10r1, RMSD = 31.7 Å) recovers 75% of hotspot contacts despite a completely different binder orientation.

The standout example is len70_clusterA_noise0 design 10 (RMSD = 31.7 Å, ipTMBoltz = 0.81): the binder is in a completely different orientation but still contacts 3 of 4 canonical clusterA hotspot residues. Pose divergence does not automatically equal complete interface disagreement — but pose-divergent designs should be treated as lower-confidence candidates until relaxation and MD in Part 6 shows whether those contacts are physically stable.

How close do binders get to hotspot residues?

The residue-level metrics above use an 8 Å Cα–Cα cutoff to call contacts. To understand what that cutoff is doing — and where edge cases live — we plotted the distribution of minimum binder Cα → hotspot Cα distances across all 31 designs and all hotspot residues (163 design×hotspot pairs per structure source).

RFdiffusion/ESMFold threaded

Boltz-2 predicted

Figure 5. Distribution of minimum binder Cα → hotspot Cα distances for threaded complexes (left, blue) and Boltz-2 predictions (right, amber). Red dashed line marks the 8 Å contact cutoff. The threaded distribution is compressed into a narrower range (roughly 4–12 Å) while the Boltz distribution spreads across a wider range (4–22 Å). RFdiffusion was explicitly conditioned on these hotspot residues, constraining the binder close to them; Boltz-2 found the same general region from sequence alone but with more positional variance.

These are unrelaxed structures — neither has been through energy minimization. Energy minimization and short MD in Part 6 will test whether local relaxation sharpens predicted interfaces, removes bad contacts, and stabilizes residue-level interactions. If the Boltz and threaded distance distributions become more similar after relaxation, that would support the idea that some of the current disagreement reflects local geometric imprecision rather than genuine model disagreement. If they remain different, the divergence should be treated as real and factored into candidate prioritization.

The three uncertain designs

DesignipTMBoltzpLDDTBoltzConfigLengthRMSDpose
len50 cB d19r00.4130.922clusterB5023.6
len100 dist d18r10.3140.918distributed10021.9
len100 dist d15r50.1990.844distributed10021.9

All three uncertain designs also show high pose divergence and near-zero binder paratope overlap — consistent across both confidence and structural metrics. These will be deprioritized going forward.

Top 5 designs by Boltz confidence

#DesignipTMBoltzpLDDTBoltzRMSDposeParatope Jac.Config
1len50 cA σ0.5 d0r00.9630.9562.7 Å0.64clusterA
2len70 cB σ0 d19r10.9420.9052.1 Å0.87clusterB
3len100 dist σ0 d5r00.9410.9722.2 Å0.61distributed
4len70 cA σ0.5 d13r10.9270.9602.1 Å0.76clusterA
5len100 cA σ0 d9r80.9230.9642.2 Å0.72clusterA

All five top-confidence designs also show strong pose convergence (RMSD < 3 Å) and high paratope overlap (Jaccard > 0.6). At the top of the confidence ranking, ipTMBoltz is a reliable indicator of binding-mode agreement.

What this means for the project

Boltz-2 validation answers the specific question this page set out to ask: is the binding interface encoded in the designed sequence, or only in the designed structure? The answer has three layers. First, 28 of 31 designs produce confident Boltz-2 complexes — the sequences encode plausible binder–target interactions. Second, about half of those (14/31) recover the same geometric binding mode as RFdiffusion. Third, even where the geometric pose diverges, the target epitope is partially conserved — both algorithms find the same functional surface on PD-L1, sometimes from different approach angles.

For protein-protein interactions, this distinction between pose and interface may matter more than it would for small-molecule docking, where a wrong pose typically means a wrong drug. In our data, both algorithms engage overlapping PD-L1 surface residues even when they disagree on the binder's approach angle — suggesting that the hotspot engagement is a robust feature of the designed sequences, not an artifact of RFdiffusion's structural conditioning. Whether these alternative poses represent genuinely accessible binding modes or simply reflect the uncertainty in unrelaxed structure prediction is a question that energy minimization and MD (Part 6) can begin to address.

This result has practical consequences for the next steps. Energy minimization and MD simulation (Part 6) will run on all 28 Boltz-supported designs, followed by affinity and stability scoring (Part 7). The three uncertain designs will be included for completeness but deprioritized in the construct selection panel. Of course, if we knew that one method was definitively "better" than the other, we might make different choices when it comes to the three constructs where the two algorithms disagree. At this time, it seems wiser to move forward where there is consensus.

Caveats

Boltz-2 confidence scores are not experimental binding affinities. A high ipTMBoltz indicates that the prediction model believes the interface is real — it does not prove that the binder will bind in vitro. There is also a known lack of agreement between Boltz-2 ipTM scores and AlphaFold 3 ipTM scores for the same interactions (UCSF, 2025), meaning the absolute values should be treated as one model's opinion, not ground truth. The scores are most useful for relative ranking within a campaign, not for absolute binding prediction.