Can Pre-training Graph Representations Contrast Sequences and Structures?

Original title: Contrasting Sequence with Structure: Pre-training Graph Representations with PLMs

Authors: Louis Callum Butler Robinson,Timothy Atkinson,Liviu Copoiu,Patrick Bordes,Thomas Pierrot,Thomas Barrett

In this article, the authors discuss the importance of understanding protein function in various fields such as drug discovery, disease diagnosis, and protein engineering. They highlight how Protein Language Models (PLMs) have been successful in analyzing protein sequences, but equivalent Protein Structure Models (PSMs) have been overlooked due to limited structural data and suitable pre-training objectives.

To address this issue, the authors introduce BioCLIP, a contrastive learning framework that pre-trains PSMs using PLMs. This framework generates meaningful representations of protein structures at both the residue and chain levels.

The authors evaluate the performance of BioCLIP on several protein tasks, including protein-protein interaction, Gene Ontology annotation, and Enzyme Commission number prediction. They find that BioCLIP-trained PSMs consistently outperform models trained from scratch and show enhanced performance when combined with sequence embeddings. They also note that BioCLIP achieves comparable or better results compared to specialized methods across all benchmarks.

Overall, this article presents a solution for obtaining quality structural data and designing self-supervised objectives, paving the way for more comprehensive models of protein function.

Original article: https://www.biorxiv.org/content/10.1101/2023.12.01.569611v1