Summary
In this work I investigated the suitability of diffusion language models for protein design by leveraging the so called
Inverse folding problem.
This means that I trained a diffusion model (
DiffuSeq) that is able to generate protein sequences with a specific 3D conformation.
Some of the data analysis is also part of the
ProstT5 paper. I also created a poster summing up the main findings of this thesis for the ISMB/ECCB 2023 conference. This can be found
here.
Abstract of the thesis:
Generative diffusion models, like DALL-E and Imagen, have demonstrated impressive proficiency in controlled data generation, particularly in the realm of high-resolution imagery,
guided by natural language text prompts. On a parallel track, AlphaFold2 has approached the threshold of resolving one of biology's most challenging tasks - the protein folding problem -
by predicting the 3D structure of proteins with unprecedented accuracy. While the structural prediction of proteins constitutes a captivating domain, the inverse problem,
namely the controlled generation of sequences that fold to a predefined 3-dimensional structure, constitutes an equally crucial scientific frontier in multiple domains.
This thesis serves as a proof-of-concept, highlighting the deployment of
diffusion language models (dLMs) towards the goal of
controlled protein design,
thereby addressing the inverse folding problem. Our workflow hinged on the application of DiffuSeq, a novel, classifier-free, sequence-to-sequence text generation
method introduced by Gong et al. (2022). The structural information was previously encoded using the 3Di structural alphabet, introduced by
Foldseek.
This format allows the reduction of a protein's spatial conformation into an one-dimensional representation by projecting the three-dimensional characteristics of an amino acid onto one of the 20 discrete 3Di tokens.
In our investigation, we first examined the dependency of the 3Di alphabet on three-state secondary structure, establishing a clear prevalence (>50%) of specific secondary structures
for each 3Di alphabet letter. Thereafter, we contrasted the generative capabilities of the two Transformer architectures: Bidirectional Encoder Representations from Transformers (BERT)
and Enhanced Transformer with Rotary Position Embedding (RoFormer). Next, we demonstrate the
superior performance of dLMs over the semi-random baseline by consistently exceeding
across all employed structure similarity metrics. However, it is important to note that, despite these promising results,
we fell short of achieving the performance levels exemplified by the
state-of-the-art method, ProteinMPNN. Nonetheless, given the discrete text nature of our conditional information, our findings underscore the potential of dLMs as powerful tools
for the generation of sequences that effectively capture and leverage the biological context in protein design.
![Thesis Results](images/ma-thesis/Final_values_with_errs.png)
Main results of the preliminary best model. The RMSD is scaled with the actual values being 4.08Å 3.49Å 2.6Å 2.63Å respectively.