基于大語言模型和機器學習模型協作的特征篩選管道助力緩蝕劑精準預測

楊景智; 劉典典; 龔海燕; 郭鑫; 金宇婷; 馬菱薇; 張達威; 李曉剛

doi:10.13374/j.issn2095-9389.2025.10.21.001

基于大語言模型和機器學習模型協作的特征篩選管道助力緩蝕劑精準預測

Collaborative feature screen with large language and machine learning model to enhance corrosion inhibitor prediction

摘要

摘要: 從工農業生產到國防科技，材料腐蝕遍及國民經濟中的各個領域，嚴重威脅設施裝備服役安全，造成巨大的經濟損失，對人類生命健康產生極大的威脅和隱患. 金屬緩蝕劑能夠改變金屬表面狀態，使電化學反應的活化能壘增高，從而減緩金屬腐蝕速率. 緩蝕劑具有低劑量、低成本、高效率等優點，因此成為應用最廣泛的腐蝕抑制手段之一. 然而，緩蝕劑的種類多樣，作用機制復雜，且與環境因素密切相關. 傳統的腐蝕研究方法，比如失重測試和電化學測試等通常需要大量的人力、物力成本和時間消耗，極大地阻礙了高性能緩蝕劑的設計與應用. 需要一種更加高效的技術手段推動緩蝕劑的研究. 近年來，材料基因工程技術的發展引領著腐蝕研究從經驗試錯向數字化、智能化方向變革，利用人工智能技術可實現對現有數據進行分析來預測龐大的未知空間，并探究材料成分、結構與性能的潛在關系. 本文基于大語言模型（LLM）和機器學習模型協作的特征篩選管道，借助系統性腐蝕知識注入、提示詞設計和遞歸篩選等技術，從209種特征描述符中篩選得到13種與飽和CO₂環境下緩蝕性能最相關的描述符，這些描述符涉及分子物理化學性質，分子結構性質以及環境參數. 篩選后，模型預測的均方誤差由121降到11. 后續的腐蝕實驗驗證了模型的預測精度與泛化能力. 本文開發的緩蝕劑特征篩選流程與機器學習模型，顯著提升了CO₂環境下高性能緩蝕劑的研發效率.

Abstract: Corrosion affects every sector of the national economy, from industrial and agricultural production to defense technology. It poses a serious threat to the safety of equipment in service, leads to substantial economic losses, and presents significant risks to human life and health. Metal corrosion inhibitors can modify the surface characteristics of metals, increase the activation energy barrier of corrosion reactions, affect surface electrochemical behavior, and slow down the corrosion rate. These inhibitors have advantages such as low dosage, low cost, and high efficiency, making them one of the most widely used methods for corrosion control. However, there are many types of inhibitors with complex mechanisms, which are closely related to environmental factors. Conventional laboratory methods such as precise weight lose analysis or electrochemical measurements such as potentiodynamic polarization and electrochemical impedance spectroscopy are labor-intensive, time-consuming, and costly, which greatly hinders the design and application of high-performance inhibitors. There is an urgent need for a more efficient approach to advance inhibitor research. A recent paradigm shift driven by advancements in materials genome engineering (MGE) is enabling researchers to move beyond the traditional trial-and-error approach. By integrating high-throughput computational tools with fundamental chemical principles, MGE facilitates a more systematic and intelligent exploration of materials science. At the core of this transformation lies machine learning (ML), which serves as a powerful pattern recognition engine. ML algorithms can analyze vast historical experimental data to predict the performance of novel materials and uncover the often hidden, nonlinear relationships between molecular features and their functional properties. In this study, we developed a novel methodology that synergizes a state-of-the-art large language model (LLM) with a predictive ML framework. The LLM was employed to systematically parse and extract meaningful molecular features from thousands of unstructured research papers and experimental datasets, specifically focusing on inhibitors used in CO₂-saturated environments. We constructed a comprehensive corrosion inhibitor research dataset by extracting 1152 data entries from 174 peer-reviewed articles on inhibitor development and application in CO₂-saturated environments. These entries contain detailed information on molecular structures, corrosion environment parameters, inhibitor concentrations, experimental temperatures, and inhibition efficiency metrics. Statistical analysis revealed that the target variables in our dataset exhibited relatively uniform distributions without significant skewness or clustering, indicating a balanced data structure that supports robust model training and generalization. Our methodology implements a two-stage feature selection strategy based on a collaborative large-small model pipeline. We first established a domain-specific knowledge framework by injecting corrosion science expertise into the Deepseek-R1 LLM, enabling systematic analysis of unstructured scientific texts. This LLM-based approach allowed us to efficiently screen an initial set of 204 molecular descriptors down to 50 candidates that demonstrate clear relevance to CO₂ corrosion inhibition mechanisms. We then applied quantitative statistical techniques using a smaller specialized model to further refine the feature set through correlation analysis and recursive feature elimination. This two-phase process reduced the final feature count to 13 non-redundant descriptors that comprehensively captured the interplay between molecular structure, inhibitor concentration, and environmental parameters. The selected 13 features reduced the mean squared error from 121 to 11 of the models. To validate our approach, we built a gradient boosting model incorporating both the selected molecular features and environmental parameters. We identified five representative molecules and their corresponding corrosion environments for experimental testing. The results demonstrated the good generalization ability of the model, confirming its potential for practical application in corrosion inhibitor design and development.

HTML全文

參考文獻(42)

施引文獻

資源附件(0)