A Quantitative Structure-Activity Relationship for Human Plasma Protein Binding: Prediction, Validation and Applicability Domain

Purpose: The purpose of this study was to develop a robust and externally predictive in silico QSAR-neural network model for predicting plasma protein binding of drugs. This model aims to enhance drug discovery processes by reducing the need for chemical synthesis and extensive laboratory testing. Methods: A dataset of 277 drugs was used to develop the QSAR-neural network model. The model was constructed using a Filter method to select 55 molecular descriptors. The validation set’s external accuracy was assessed through the predictive squared correlation coefficient Q2 and the root mean squared error (RMSE). Results: The developed QSAR-neural network model demonstrated robustness and good applicability domain. The external accuracy of the validation set was high, with a predictive squared correlation coefficient Q2 of 0.966 and a root mean squared error (RMSE) of 0.063. Comparatively, this model outperformed previously published models in the literature. Conclusion: The study successfully developed an advanced QSAR-neural network model capable of predicting plasma protein binding in human plasma for a diverse set of 277 drugs. This model’s accuracy and robustness make it a valuable tool in drug discovery, potentially reducing the need for resource-intensive chemical synthesis and laboratory testing.


Introduction
Many drugs interact with plasma or other molecules, such as DNA, to form a drug-molecule complex.The process is called protein binding, more specifically the binding of drugs to proteins.The bond drug remains in the bloodstream while the unbound component can be metabolized or excreted to become the active component. 1n short, protein-binding process is defined as the formation of complexes: hydrogen bonding, hydrophilic bonding, ionic bonding, Vander Walls bonding, and covalent bonding.
The binding of drugs to proteins can be reversible or irreversible. 2,3Irreversible drug-protein binding is the result of chemical activation of a drug tightly binding to a protein or macromolecule through a covalent chemical bond.Irreversible drug binding is responsible for some types of drug toxicity that can occur over a long period of time. 4Reversible drug-protein binding means that the drug binds to weaker chemical bound, such as hydrogen bonds or Vander Waals forces.At low drug concentrations, most of the drug is bound to the protein, while at high drug concentrations, the protein is bound to the sites to saturate, leading to a rapid increase in the free drug concentration.Therefore plasma protein binding plays a key role in drug therapy as it affects the pharmacokinetics and pharmacodynamics of the drug as it is often directly related to the concentration of free drug in plasma. 5,6he construction of in silico models that establish a mathematical relationship between the molecular structure and the properties of interest is an important step in drug discovery as it avoids chemical synthesis and expansive and lengthy ones laboratory tests reduced. 7,8n recent years, several QSAR models have been developed to predict plasma protein binding and powerful plasma protein binding prediction algorithms are used, such as support vector machines and their derivatives, 9-11 the random forest, 12 neural networks, 13,14 and gradient boosting decision trees. 15In 2017, Sun et al constructed QSAR models using six machine-learning algorithms with 26 molecular descriptors. 16Kumar et al presented in 2018 a systematic approach using support vector machine, artificial neural network, K-nearest neighbor, probabilistic neural network, partial least square, and linear discriminant analysis for a diverse dataset of 735 remdies. 17Yuan et al. published a global quantitative structure-activity relationships (QSAR) model for plasma protein-binding in 2020, and developed a novel strategy to construct a robust QSAR model for predicting plasma protein-binding. 18Altae-Tran et al introduced deeplearning healthcare techniques successfully predicting drug activity and structure. 19Wallach and his co-authors introduced AtomNet, known as the first structure-based deep convolutional neural network, to predict small molecule bioactivity for drug discovery applications. 20his work uses a systematic methodology based on QSAR, Filter method, and feed-forward neural network (FFNN) to predict plasma protein binding for 277 molecules.Filter method, known as the most popular feature selection technique, was used to reduce the descriptors.A feed forward neural network was then used to predict plasma protein-binding from the extracted descriptors.

Materials and Methods
A five-step process was employed to predict the plasma protein-binding, as shown in Figure 1: (1) data set collection, (2) molecular descriptors generation, (3) selection of relevant descriptors by a filter method, (4) FFNN modeling, (5) validation of models.

Data set collection
The experimental data values of protein-binding of the 277 drugs used in this study were selected from the pharmacological basis of the therapeutics handbook 21 and the handbook of clinical drug data. 22Chemical names and experimental protein-binding values are presented in Supplementary file 1.This dataset was divided into two parts.The first one with 235 plasma protein-binding values, dedicated to develop the QSAR model.The second included 42 elements left for the external validation.The data was partitioned using holdout cross-validation.

Molecular descriptors generation
The numerical representation of molecular structure was assessed in terms of molecular descriptors; The SMILES script (simplified molecular input line-entry system) required to calculate descriptors was extracted from the open-access database PubChem. 23SMILES is a standard for specifying the structure of chemical species that takes the form of a line notation. 24Table 1 lists 1666 descriptors that were sorted into twenty categories using the SMILES scripts for the 277 drugs.The E-Dragon online programs, 25 also known as the electronic remote version of the well-known software DRAGON created by the Milano Chemometrics and QSAR Research Group by Prof. R. Todeschini, were used to collect all descriptors.In Supplementary file 2, the name and number of calculated descriptors are presented.

Selection of relevant descriptors
7][28] It also reduces the overfitting and the overtraining risk. 29Feature selection methods are widely available in the literature.The characteristics, advantages, and disadvantages of the three main strategies that can be used for the selection of relevant descriptors are reported in Table 2. 30 The following procedure was used to reduce the number of molecular descriptors 31 : 1. Descriptors having constant values (min = max) were eliminated.2. Quasi-constant descriptors (1 st quartile 25% = 2 nd quartile 75%) were removed.3. Descriptors with standard relative deviation RSD < 0.05 were deleted.The three steps above were performed using STATISTICA software. 32. Matrices of the pairwise linear correlation between each pair of the column in the input matrices were calculated via MATLAB. 33Additionally, every variable that has a correlation coefficient R > 0.75 were removed.For more robustness of the model, the variance inflation factor VIF whose equation is as follows was calculated: Where R is the squared correlation coefficient between the ith descriptor and the others.All descriptors with VIF > 5 were eliminated from the model. 34

Model development
For the purpose of predicting the plasma protein-binding, the selected descriptors were used as inputs in FFNN.There are different approaches to discover the number of hidden neurons required for a modeling task explained in detail in a review named methods of selecting the number of hidden nodes in Artificial Neural Networks review. 35n this work, the following steps were used to choose the number of neurons in the hidden layer 36 : 1.Initially, only five hidden neurons were taken.2. The FFNN is trained until the mean square error does no longer seem to improve.3.At this moment, five neurons are added to the hidden layer, each with randomly initialized weights, and resumed training.4. The steps 2 and 3 are repeated until a termination criterion has been satisfied.The mathematical equation of the model used for the prediction of protein binding is: ) is the input that corresponds to the number of data included in the training of the ANN, i from 1 to 15, wij(i = 1…p, j = 1…k) are weights from input to hidden layer, b j (j = 1…k) are biases of the neurons in the hidden layer, k = 40 for filter method, w2j(j = 1…k) are weights from the hidden to the output layer, b is the bias of the output neuron and fb is the output.

Model validation
We established internal and external validation criteria to assess the QSAR models' generalizability and predictive power.The following statistical parameters were used in our investigation to evaluate the models' efficacy: the mean squared error (MSE), correlation coefficient (R), predictive squared correlation coefficient (Q 2 ), and coefficient of determination (R 2 ) values.Wrapper methods select a subset of relevant features using a learning algorithm.
Includes the classifier construction for the optimal feature selection.
Use feature relevance score to select the top rank features.
Conduct search in the space of possible parameters.
Like wrapper approaches, these methods are specific to a given learning algorithm.

Examples Examples Examples
Information The residual sum of squares (RSS) is the difference between the fitted values and the observed values.The sum of squares (SS) refers to the difference between the observation and their mean.The PREdictive residual SS (PRESS) is the difference between the predictions and the observations.

Results and Discussion
The results obtained from the selection of the most important descriptors using the correlation coefficient R and the variance inflation factor VIF showed that 55 descriptors seemed to be the most appropriate.The calculated VIFs among the values of the selected descriptors are less than five, indicating that multicollinearity between the selected descriptors is acceptable.To get an overview of the correlation structure we used a heatmap to highlight what is important (Figure 2).Table 3 shows the VIF values for the selected descriptors and their meanings.
We followed the above-mentioned procedure to determine the required number of hidden neurons.The best model's accuracy was assessed using the R(all), MSE(validation), 2 train R , and Q 2 criteria.The best model was chosen based on the maximum R(all), 2 train R , and Q 2 and the lowest MSE (validation). 31,37Table 4 shows 10 network models developed.The results obtained show that network eight with 40 neurons is the best model with R (all) = 0.990, 2 train R = 0.981, Q 2 = 0.989, and MSE (validation) = 0.002.The best performance of the model had a topology of (55-40-1): 55 input nodes, one hidden layer with 40 nodes having the hyperbolic tangent as a transfer function, and one output layer with an identity function.The neural networks were implemented using Neural Network Toolbox for MATLAB. 33Figure 3 shows the predicted protein-binding values versus the experimental ones for the training and validation sets.The results show a close correlation between predicted and observed plasma protein-binding.The network type used is a Feed-Forward Network with the Levenberg-Marquardt backpropagation training function and gradient descent with momentum weight and bias learning function and the data was partitioned using holdout cross-validation.The difference between 2 train R and Q 2 was equal to 0.008.this difference did not exceed 0.3 indicating the robustness of the model. 38n order to investigate the predictability and performance of the model developed in this work, a statistical evaluation is carried out, as shown in ).External validation parameters were also used to evaluate the model's quality.We can say that this model stands out due to its high predictive power.The excellent Q 2 value is greater than 0.9. 38

Comparison between models from literature
We made a comparison between the few models reported in the literature with our developed model for the prediction of the binding of drugs to plasma proteins (Table 6).The evaluation of the advantages and disadvantages of these methods is quite difficult (each study used different data sets and different modeling approaches).We can see that the statistical parameters of our study exceed the models

Applicability domain
A clearly defined applicability domain is recommended as the principle in OECD 41 guidelines.In this work, we analyzed the domain of applicability with different approaches reported in Table 7 with the results.The proposed approaches' algorithm and method can be found in the literature. 42,43he number of samples inside the applicability domain varied depending on the method used.Euclidean distance (95 percentile) and Classical KNN (Euclidean distance, k = 5) identified two test samples out of the domain of applicability.KNN (Euclidean distance k = 25) showed one of the test samples out of the applicability domain.Bounding box considered 03 test samples out of the applicability domain as shown in Figure 4.Although our points are far from the rest of the observations, they are close to the regression fitted line because they have a small residual, we speak of good leverage points.These results show that the model can be used to predict plasma protein binding for new compounds that have not been tested.

Conclusion
In this study, we constructed a QSAR model to predict 277 human plasma protein binding.The feature selection strategy by a Filter method has produced 55 inputs, which were used to train a FFNN for predictions.Examination of the estimates of external and internal criteria indicated that the QSAR model developed is robust, externally predictive, and distinguished by a good applicability domain.The external accuracy of the validation set was calculated by the Q 2 and RMSE which are equal to 0.966 and 0.063 respectively.98.30% of the external validation set is correctly predicted.According to the OECD principle, we can say that this QSAR model can be used to predict the fraction of human plasma protein binding for drugs that have not been tested to avoid chemical synthesis and reduce expansive laboratory tests.

Figure 1 .
Figure 1.Flow sheet of the procedure followed

Figure 2 .Figure 3 .
Figure 2. Heatmap of the correlation matrix for Filter method

Table 2 .
Feature selection methods and their advantages and disadvantages

Feature selection with filter methods Feature selection with wrapper methods Feature selection with embedded methods
Relevance of the features is calculated by considering the intrinsic properties of the data.

Table 5 .
The model's robustness is demonstrated by the fact that the internal validation's statistical coefficients are all acceptable and satisfactory (lowest MSE, RMSE, and MAE, as well as high 2 ,

Table 3 .
The VIF values for the selected descriptors by filter method

Table 4 .
Selected criteria of the different multi-layer perceptron for Filter method

Table 5 .
External and internal criteria of the model