圖模型在分子信息學中的研究綜述與展望

戴嘉欣; 付冬梅; 張達威; 馬菱薇

doi:10.13374/j.issn2095-9389.2025.03.02.002

圖模型在分子信息學中的研究綜述與展望

Overview of graph models and their prospects in molecular informatics

摘要

摘要: 分子信息學作為化學與人工智能交叉融合的前沿領域，正迅速推動藥物設計與功能材料開發等領域的技術革新. 分子表示學習作為其核心基礎，通過將分子結構編碼成保留其拓撲與理化性質的數值向量，為分子性質預測和分子生成等下游任務提供高效特征表示. 相比基于規則和基于字符序列的表征，圖模型能夠充分利用分子天然的圖結構（原子為節點、化學鍵為邊），能夠精準捕捉分子拓撲信息和復雜相互作用，現已成為該領域的主流技術. 本文系統綜述了圖模型在分子信息學中的最新研究進展和應用. 首先詳細梳理了分子表征方法的發展歷程，闡述圖模型的基本概念和獨特優勢. 其次，圍繞分子性質預測和分子生成兩大核心任務，系統梳理了常用數據集、評價指標以及各類圖判別和圖生成模型的特點與研究現狀. 同時，結合材料性能預測與晶體生成任務，探討了不同深度圖模型在實際應用中的優缺點、適用場景以及技術挑戰. 最后，探討了大規模預訓練、可解釋性方法和多模態學習等新興趨勢在分子信息學中的應用潛力，并展望了未來研究方向. 本綜述旨在為化學領域研究者快速定位前沿技術與適用方法，同時為人工智能領域研究者梳理技術路線，以推動更高效的算法設計及其在分子信息學中的落地應用.

Abstract: The rapid growth of molecular data and advances in deep learning have facilitated significant strides in molecular informatics. Molecular informatics is an emerging field that integrates chemistry, computational science, and artificial intelligence (AI) and employs data-driven methods to decode the relationships between molecular structures and their properties, thereby supporting drug design and material discovery. Molecular representation learning (MRL) is a fundamental aspect of molecular informatics, involves encoding molecular structures and properties into numerical vectors to provide efficient representations for downstream tasks. High-quality molecular representations are critical for accurate property prediction, optimization, and generation. However, traditional rule-based MRL methods rely on handcrafted features that are time-consuming and expert-dependent. Sequence-based MRL methods, such as the simplified molecular input line entry system (SMILES), often separate connected atoms into distant positions, leading to suboptimal representations that fail to fully capture spatial and topological information. In contrast, given that molecules naturally form graph structures with atoms as nodes and bonds as edges, graph-based models can effectively utilize these molecular graphs. Aided by the exceptional performance of graph models in representing complex structures, learning cross-scale features, and constrained optimization, graph-based MRL methods have achieved significant advancements in the prediction and generation of molecular properties. In this review, we first introduce the evolution of molecular representation methods, focusing on 2D and 3D molecular graph representations. We then classify the graph models into discriminative and generative categories and discuss their concepts and applications. Graph-discriminative models encode topological structures and node/edge features to capture nonlinear structure-property relationships for classification and regression tasks. Graph-generative models learn from molecular distributions to optimize existing structures or design novel compounds with the desired properties. Next, we review commonly used datasets, evaluation metrics, and research progress in molecular property prediction and molecular generation. Molecular property prediction is employed to predict physical and chemical properties by analyzing internal molecular information, thereby helping researchers quickly identify suitable candidates from a large number of potential compounds. We briefly present the three categories of the property prediction methods: 2D graph-based, 3D graph-based, and domain knowledge-integrated approaches, and introduce a recent representative method for each category. Furthermore, we review the research focusing on various graph neural network models in material property prediction tasks and their corresponding application scenarios. The goal of molecular generation is to learn latent distributions from limited datasets and generate novel structures that satisfy specific chemical functions through sampling and decoding. We introduce widely used frameworks for molecular generation such as variational autoencoders (VAEs), generative adversarial networks (GANs), normalizing flows, and diffusion models, which have demonstrated strong capabilities in capturing complex molecular features and optimizing chemical properties while preserving chemical validity. In addition, using crystal material generation as an example, we introduce and compare different deep generative models for material discovery, highlighting their specific application scenarios, strengths, and limitations. Finally, we discuss future research directions for graph models in molecular informatics from the perspectives of large-scale pre-training, explainable AI, and multimodal learning strategies. This review aims to assist molecular informatics researchers in identifying cutting-edge studies and applicable methods, while clarifying the technical pathways for AI researchers to promote more efficient algorithm design and implementation.

HTML全文

參考文獻(141)

施引文獻

資源附件(0)