Part 1: De novo mini binder design against PD-L1

Computational design of novel mini protein binders using the RFdiffusion → ProteinMPNN → AlphaFold pipeline

RFdiffusion ProteinMPNN ESMFold PyMOL BioPython Python Google Colab

Objective

Design de novo mini protein binders (~70 residues) targeting the PD-1 binding surface of PD-L1, using modern generative protein design tools. PD-L1 is a key immune checkpoint target in oncology — blocking its interaction with PD-1 restores T cell anti-tumor activity. This project demonstrates end-to-end computational protein design: from target structure analysis through backbone generation, sequence design, and in silico validation.

This is a portfolio project demonstrating computational protein design methodology. The in silico pass candidates are computationally promising but have not been experimentally tested — confirming function would require gene synthesis, expression, binding assays, and biophysical characterization.

Computational methods landscape

Computational protein engineering draws from three complementary methodological domains. Each captures different information about proteins — evolutionary history, physical constraints, or generative capacity — and is suited to different design questions. Knowing which tool to reach for, and why, matters more than knowing how to run any single tool.

Domain 1

Sequence-based

What has evolution explored?

Domain 2

Structure-based

What is physically feasible?

Domain 3

Generative / de novo

What can we create beyond nature?

Generative / de novo methods

Creating new proteins that don't exist in nature. Includes diffusion models (RFdiffusion — generate novel backbones from noise), inverse folding (ProteinMPNN — design sequences for a given backbone), and language model-guided generation. These methods explore regions of sequence and structure space that evolution never visited. This project sits primarily here, using RFdiffusion for backbone generation and ProteinMPNN for sequence design, with structure-based validation (ESMFold) and sequence-based analysis (per-position entropy).

When to use which approach

The strategic question is always: what do I know, and what do I need? The starting point determines the method.

Starting point

Appropriate method

Not appropriate

Target structure, no known binder

De novo generative (RFdiffusion)

Homology modeling (no template exists)

Known scaffold, need better sequence

Inverse folding (ProteinMPNN), CDR design

Full de novo (overkill — scaffold is already validated)

Large protein family, many natural sequences

Covariation analysis, MSA-based methods

De novo generative (ignores available evolutionary data)

Known protein, assessing point mutations

Language model log-likelihoods, Rosetta ΔΔG

AlphaFold alone (predicts structure, not stability change)

Candidate designs, need validation

Structure prediction (ESMFold), MD simulation, docking

ProteinMPNN score alone (necessary but not sufficient)

Known binder + co-crystal, want higher affinity

Rosetta interface design, affinity maturation

RFdiffusion (you already have the binding mode)

Project phases

Every experimental screening campaign faces the same constraint: a finite budget buys a finite number of variants, and each well, droplet, or display cycle costs real money. The cost per variant is aggressively minimized during assay development, but there is always a ceiling. The strategic question has remained the same for 30+ years: given the available spots, how do I pack them with the best guesses for the most optimal candidates, so I'm not spending good money on dead wells?

In the absence of a good prediction, you go with randomness — error-prone PCR, NNK saturation, random mutagenesis — and let the screen do the work. This is a defensible strategy when you truly have no prior information, but it means the vast majority of your screening capacity is consumed by variants that were never going to work. The question has always been: can we do better than random? That is where AI-based approaches now sit. The tools are new — RFdiffusion, ProteinMPNN, protein language models — but the framework they serve is not. They are the current generation's answer to the same question that drove rational design, consensus sequence analysis, and structure-guided mutagenesis before them. What changes is the quality of the prediction. What stays the same is the goal: enrich the library for viable candidates so that experimental throughput is spent on sequences that have a reasonable chance of working.

This project's three-phase structure reflects that logic. Each phase filters candidates more aggressively, at increasing cost per candidate. The goal of upstream computation is to reduce the number of candidates that enter expensive downstream experimental work.

Phase 1

In silico design

Generate candidates

Cost: compute time

Throughput: 10²–10³

Phase 2

Computational validation

Filter aggressively

Cost: compute time

Throughput: 10¹–10²

Phase 3

Experimental screening

Validate in the lab

Cost: $$$ per candidate

Throughput: gene synthesis → uHTS

This project covers Phase 1 (RFdiffusion → ProteinMPNN → ESMFold) with elements of Phase 2 (structural validation, per-position entropy analysis, developability considerations). Phase 3 — gene synthesis, display-based screening, biophysical characterization — is where computational designs meet experimental reality. Every candidate killed computationally in Phases 1–2 is a candidate you don't synthesize, express, and screen.

Pipeline overview — Phase 1 detail

Click each stage to see details about the tools, inputs, outputs, and decision points.

Target
preparation

→

RFdiffusion
backbone gen

→

ProteinMPNN
sequence design

→

ESMFold
validation

→

Analysis &
filtering

Stage 5: Analysis & filtering

Merge all metrics (backbone geometry, ProteinMPNN scores, ESMFold pLDDT, RMSD to design) into a unified ranking. Apply dual threshold: pLDDT > 80 (confident fold) AND RMSD < 2.0 Å (matches designed backbone). 4 of 10 candidates passed both structural filters. These are not confirmed binders — they are the subset whose designed sequences are predicted to adopt folds close to the intended backbones.

3D structure viewer

Interactive view of PD-L1 target (PDB 4ZQK) with the PD-1 binding interface highlighted. Rotate by dragging, zoom with scroll. Toggle views to explore the binding surface your designed binders target.

PDB 4ZQK · PD-1 / PD-L1 complex

Validation results

Backbones generated

Sequences scored

In silico pass candidates

40%

Hit rate

Candidate ranking

Designs passing both structural filters (pLDDT > 80, RMSD < 2.0 Å) are highlighted. These are not confirmed binders — they are the subset whose designed sequences are predicted to adopt folds close to the intended backbones. ProteinMPNN score reflects inverse-folding confidence; higher means the model is more certain the sequence matches the backbone.

Design	pLDDT	RMSD (Å)	MPNN score	Status

Validation scatter: pLDDT vs RMSD

The quadrant plot below separates in silico pass candidates (high confidence, low deviation) from failures. Designs in the upper-left quadrant pass both structural filters.

Structure prediction confidence vs design recapitulation

ProteinMPNN score vs structural accuracy

Does higher ProteinMPNN confidence predict better structural recapitulation? In this small sample, the correlation is weak — high MPNN scores don't guarantee low RMSD, reinforcing that structure-prediction validation is essential and MPNN score alone is insufficient for filtering.

Design score vs RMSD (color = pLDDT)

Key insight: ProteinMPNN score does not reliably predict whether a design will pass structure-prediction validation. The inverse-folding model optimizes for sequence-backbone compatibility, but the predicted fold from ESMFold may find a different, lower-energy conformation. This underscores the importance of the full pipeline — no single metric is sufficient.

Design decisions & scientific reasoning

Why PD-L1?

PD-L1 was chosen because high-resolution crystal structures exist (PDB 4ZQK, 2.45 Å), the PD-1/PD-L1 binding interface is extensively characterized in the literature, the Baker lab has published de novo binder designs against this exact target using the same tools, and it carries immediate therapeutic relevance — enabling direct benchmarking against known results.

Why helical bundles?

RFdiffusion was not constrained to any particular topology. Given a 70-residue length and the flat PD-L1 binding surface, the model independently converged on three-helix bundles — consistent with published results showing that helical scaffolds are thermodynamically favorable at this size and present good complementary surfaces for flat protein-protein interfaces.

On developability

The RFdiffusion → ProteinMPNN → AlphaFold pipeline optimizes for two objectives: folding and binding geometry. It says nothing about expression yield, aggregation propensity, immunogenicity, or manufacturing feasibility. In a real drug development campaign, in silico pass candidates would enter a developability assessment before any experimental work — scoring for surface hydrophobicity, charge distribution, deamidation motifs, and unpaired cysteines. The computational pipeline constrains the search space; it does not replace experimental validation.

On library design

One of the most practically valuable outputs of ProteinMPNN is not the top-scoring sequence — it's the per-position probability distribution across all 20 amino acids. When you sample 8 sequences from a single backbone at low temperature (T=0.1), positions where the model is confident produce the same amino acid across all samples (low Shannon entropy). Positions where the model is uncertain sample freely across multiple identities (high entropy). This distinction maps directly onto a fundamental choice in experimental protein engineering: what to fix and what to randomize.

In a conventional directed evolution campaign starting from scratch — using NNK codons, error-prone PCR, or random mutagenesis — every position is treated as equally uncertain. For a 70-residue mini binder, this produces a theoretical sequence space of 20⁷⁰ variants. In practice, the vast majority of this space is dead: sequences that don't fold, don't express, or don't engage the target. The chance of sampling a functional variant is vanishingly small, and most screening throughput is wasted on sequences of no interest.

An AI-informed library compresses this space by fixing the positions ProteinMPNN is confident about — the residues where backbone geometry, packing contacts, and interface interactions strongly constrain amino acid identity — and applying combinatorial diversity only at the tolerant positions. The result is a library enriched for sequences that can actually fold and bind, while remaining large enough for robust screening.

AI-informed vs random library — screening funnel

Per-position entropy analysis across the 4 in silico pass designs in this project (computed from 8 ProteinMPNN samples per backbone at T=0.1) shows that 16–34 positions are effectively fixed, while 36–54 positions are tolerant — compressing theoretical diversity from 20⁷⁰ to 20³⁶–20⁵⁴ depending on the scaffold. This represents a reduction of 16–34 orders of magnitude in sequence space, with the remaining diversity concentrated in positions the model identifies as computationally permissive.

In practice, this framework integrates AI-guided design with existing high-throughput experimental infrastructure. Fixed positions inform a defined sequence core; tolerant positions are degenerated using NNK codons or tailored degenerate codons (e.g. NDT, VHG) that enrich for the specific subset of amino acids ProteinMPNN identifies as compatible at each site. The resulting gene library feeds directly into display technologies — phage, yeast, or mRNA display — for ultra-high-throughput screening at 10⁶–10⁸ variants, followed by plate-based and biophysical screens at progressively higher resolution. The key advantage over naive randomization is not just a smaller library — it is a library where a much higher fraction of members are variants of interest rather than dead sequences sampling irrelevant space.

This approach also connects naturally to developability. Positions identified as tolerant by ProteinMPNN are often solvent-exposed, meaning substitutions there are less likely to disrupt folding but may influence expression, aggregation, or immunogenicity. Rational substitutions at tolerant sites — replacing methionines to reduce oxidation liability, or introducing charged residues to improve solubility — can be incorporated into library design without sacrificing the AI-informed scaffold constraints. The result is a campaign that is simultaneously AI-guided and experimentally tractable.

Limitations

This analysis evaluates monomer fold recapitulation only — whether a designed sequence is predicted to adopt its intended backbone structure. It does not evaluate binding affinity, interface stability, complex formation with PD-L1, expression yield, solubility, aggregation propensity, specificity, or immunogenicity. ESMFold was used for binder-only structure prediction, not for binder–target complex modeling. Passing the in silico structural filters (pLDDT > 80, RMSD < 2.0 Å) means the sequence is predicted to fold as designed — it does not mean the resulting protein will bind PD-L1 or function as a therapeutic.

Experimental validation would require gene synthesis, recombinant expression, binding assays (SPR, BLI, or ELISA against PD-L1), and biophysical characterization (SEC, DLS, thermal stability). In published pipelines using similar tools, 10–30% of computationally passing designs show measurable binding in the lab. The 4 candidates identified here represent a starting point for experimental testing, not an endpoint.

Planned extensions

Molecular dynamics validation of top candidates using OpenMM (fold stability at 300K, interface contact persistence). Disulfide bridge engineering as a discriminating test — if a marginally stable design unravels even with a covalent crosslink, the fold was never a real energy minimum. VHH scaffold-constrained design against the same target as a second project demonstrating range across design paradigms.