A QSAR Study on the 4-Substituted Coumarins as Potent Tubulin Polymerization Inhibitors

Purpose: Despite the discovery and synthesis of several anticancer drugs, cancer is still a major life threatening incident for human beings after cardiovascular diseases. Toxicity, severe side effects, and drug resistance are serious problems of available commercial anticancer drugs. Coumarins are synthetic and natural heterocycles that show promising antiproliferative activities against various tumors. The aim of this research is to computationally study the coumarin derivatives in order to develop reliable quantitative structure-activity relationship (QSAR) models for predicting their anticancer activities. Methods: A data set of thirty one coumarin analogs with significant antiproliferative activities toward HepG2 cells were selected from the literature. The molecular descriptors for these compounds were calculated using Dragon, HyperChem, and ACD/Labs programs. Genetic algorithm (GA) accompanied by multiple linear regression (MLR) for simultaneous feature selection and model development was employed for generating the QSAR models. Results: Based on the obtained results, the developed linear QSAR models with three and four descriptors showed good predictive power with r2 values of 0.670 and 0.692, respectively. Moreover, the calculated validation parameters for the models confirmed the reliability of the QSAR models. Conclusion: The findings of the current study could be useful for the design and synthesis of novel anticancer drugs based on coumarin structure.


Introduction
Cancer is one of the severe life-threatening human health problems worldwide. 1,2 Despite significant development in cancer chemotherapy in the past 50 years, cancer continues to be the second most frequent cause of death after cardiovascular diseases. 2,3 There are numerous reports in the literature on the discovery of novel anticancer agents, but there is no single drug with 100% efficacy for the cancer treatment. 4 Most of the clinically used drugs have limited effectiveness and selectivity, accompanied with serous toxicity, and unacceptable side effects. 5,6 Moreover, the most common tumors show resistance against the significant number of commercially available anticancer drugs. 6 Therefore, considerable demand for the discovery of efficient new anticancer drugs continues to exist in order to overcome the current chemotherapeutic problems in cancer treatment. 5 Coumarin and its derivatives are important oxygen containing heterocycles which are found in natural products. Over the past decades, coumarins have attracted great attention because of their interesting biological and pharmacological activities such as anticoagulant, 7 antiinflammatory, 8,9 antioxidant, 10 antiviral, 11 antimicrobial, 12,13 antidepressants, 14 and anti-HIV effects. [15][16][17] Also, they are promising compounds due to their low toxicity, little drug resistance, less side effects, high bioavailability, and ease of chemical synthesis. 18 Several studies have shown that coumarins are potential anticancer agents having growth suppressive effects on many types of cancers such as ovarian, 19 breast, 20,21 skin, 22 prostate, 23 liver, 24,25 and pancreatic. 26,27 Computer assisted drug design (CADD) has attracted considerable attention in modern drug discovery and development by reducing the time-consuming and expensive synthetic and biological experiments needed to achieve the required results. 28 Quantitative structureactivity relationship (QSAR) studies as part of CADD techniques play a critical role in medicinal chemistry for the design of new therapeutically active compounds. [29][30][31] QSAR studies are used for the prediction of the biological activity and may also be used for the interpretation of the mode of ligand-receptor interaction. The required time and cost spent for drug design and discovery are significantly decreased by using various QSAR techniques. 32 In the current work, a QSAR analysis was conducted on a set of coumarin analogs for which biological activities have been reported in the literature. 33 Using GA-MLRbased two-dimensional QSAR analysis, the cell toxicity of the studied coumarins was correlated to their structural features. Based on the obtained results, the developed linear models showed good predictive power, and can be used in designing new anticancer agents.

Methodology Data set
The experimental IC 50 (nM) values obtained for antiproliferative activities of coumarin derivatives (31 compounds) against HepG2 cell line, reported by Cao et al, 33 were used in the present study. For QSAR analysis, all the biological data were converted into pIC 50 (i.e., -log IC 50 ).

Molecular descriptors
The 3D structures of the ligands were built by GuassView 5.0 software. 34 The energy minimization of the structures were conducted initially using the empirical method (i.e., MM+) 35 followed by semi-empirical technique AM1 36 using the Polak-Ribiere algorithm included in HyperChem 7.5 software. 37 The molecular descriptors for the fully optimized molecular structures were calculated using Dragon (version 3.0) program. 38 Log P and log D were calculated by ACD/Labs 6.0 program 39 while the molar refractivity, surface area, density, and polarizability were calculated using HyperChem 7.5 software. From the total different molecular descriptors calculated by Dragon software, descriptors with 50% constant values were omitted. Moreover, descriptors were pretreated to remove those with more than 0.95 correlations. 40 These pretreatments on the descriptors were performed using R 3.2.3 software. 41 Methods Three algorithms were used for dividing the data set into train and test sets. These include Kenard-Stone, Euclidian Distance, and Activity/Property methodologies which are available in a java-based tool. 42,43 For reducing the number of molecular descriptors, as well as selecting the appropriate features, multi linear regression (MLR) method optimized by incorporating the GA algorithm known as GA-MLR was used. This tool is a java-based graphical user interface and proposes an MLR model based on five validation parameters i.e. r 2 , r 2 Adjusted , q 2 , 2 rm , and 2 rm ∆ with their default values set to > 0.6, > 0.6, > 0.6, > 0.5, and < 0.2, respectively. 44 The GA-MLR approach was carried out with its default settings for finding the linear equations with three and four parameters. Although, GA-MLR was only applied on the train set, however for validating the generated models on the test set compounds, four criteria, i.e., Q 2 (test) , absolute percentage error (APE), mean absolute percentage error (MAPE), and standard deviation of error of prediction (SDEP) calculated according to equations 1, 2, 3, and 4, were used: Where, y obs,I , pIC 50(obs) , y pred,i and pIC 50(pred) are the experimental and predicted activities of an individual compound in the test set, respectively. N is the number of molecules and y m is the mean of experimental biological activities of the compounds. PRESS is the predictive residual sum of the squares. The applied fitness function (i.e., F) in this approach is as follow (for more details readers may be referred to the manual of the GA-MLR):

Results and Discussion
The structures of coumarin analogs used in the current study are shown in Table 1. The size, lipophilicity, and electronic features of the substituents are different. For extracting chemical information from the data set compounds, computing a wide range of structural descriptors is essential for any successful QSAR analysis. In the various fields of chemometrics, it is clear that utilizing an effective variable selection method which results in reducing the complexity of the model, can improve the interpretability and the predictive ability of the developed model. [45][46][47] For developing a QSAR model that explains the antiproliferative activities of the compounds shown in Table 1 on HepG2 cells, large number of structural parameters belonging to different classes of descriptors such as those listed in Table 2, are used.
The total set of compounds was randomly divided into train (21 compounds, 70% of the whole data set) and test sets (10 compounds, 30% of whole data set) for the generation of QSAR models and validating the developed models, respectively. For this purpose, the hybrid methodology developed in Roy's Lab known as GA-MLR was used on   three train sets differently selected based on the three data division algorithms mentioned in Methods section. The model building processes for each set were run for thirty times for generating three-parameter models. This was led to total of 90 different models with three parameters. For all of these models the r 2 values were compared and the best data division method was identified as being Euclidean Distance method. Then, for achieving better results on this set, twenty further runs using the same settings were performed. Moreover, for generating fourparameter equations, the best three-parameter equation was used such that the forth parameter was added one at a time from the whole pool of descriptors (i.e., in an allwalk manner) to identify the best four-parameter model. Equations 6 and 7 are the best three-and four-parameter models, respectively.
pIC 50 Where N, r 2 , Q 2 (test) , and MAPE are the number of compounds, the squared correlation coefficient of train set, the squared correlation coefficient of test set, and the mean absolute percentage error, respectively. Table 2 shows the statistical parameters of two developed models with more details. The numerical values and detailed information about the selected descriptors are listed in Tables 2 and 3. Correlation matrix of selected descriptors is represented in Table 4. The three-parameter model (Eq. 6) predicts the antiproliferative activities of the studied coumarins using RDF030u, LP1, and EEig02x descriptors. RDF030u (radial distribution function 3.0/unweighted) belongs to the group of Radial Distribution Function descriptors that are obtained by radial basis functions centered on different interatomic distances ranging from 0.5 to 15.5 Å. 48 The Radial Distribution Function in a system of particles (atoms, molecules, colloids, etc), describes how density varies as a function of distance from a reference particle. When studying the chemical properties of a compound, the probability distribution of atoms scattered in a spherical volume with radius of 3.0 Å is regarded as an important factor. 49,50 The LP1 feature, which belongs to the topological descriptors, is one of the 2D matrix-based descriptors, and is calculated by eigenvalues of a square (usually symmetric) matrix representing a molecular graph. 51 Du and colleagues have reported that small and large values of LP1 are indicative of compounds with less and more branches, respectively. 52 On the other hand, LP1 is a molecular branching index. The negative coefficient of LP1 in both of the developed equations indicates that the pIC 50 is inversely related to this descriptor, which suggests that 4-substituted coumarins with lesser branches in the overall structure may be show the higher antiproliferative activity. The next feature (i.e, EEig02x), also belonging to the topological descriptors has been derived from the edge adjacency matrix weighted by edge degrees. 53 This descriptor is associated with molecular polarity and describes the electronic effects as well as the hydrophobic properties of molecule. 54 The second QSAR model (Eq. 7) describes the activities of coumarin analogs using one extra parameter added to the three previously explained features. The new parameter, i.e., Mor04p belongs to 3D-MoRSE group of descriptors, and is calculated by incorporating the polarizability-based weighting of the scattering features of the molecules. 55 The presence of Mor04p descriptor in the developed model can be regarded as an evidence for the importance of the 3D arrangement influence of the molecules extracted from electron diffraction studies 56 on the antiproliferative activities of the studied compounds. The increase of pIC 50 directly correlates to the shape and size of the studied 4-substituted coumarin derivatives.
The predictive power of the developed models was evaluated using internal and external validation measurements. For this purpose, the squared correlation coefficient (r 2 ), leave one out cross-validated correlation coefficient (q 2 (LOO) ), the squared adjusted correlation coefficient (r 2 Adjusted ), the standard error of estimate (SEE), the SDEP, were calculated for the train set and Q 2 (test) was computed for the test set ( Table  3). The squared correlation coefficient is the parameter fitted on the whole train set and the QSAR models with r 2 > 0.6 are considered reliable. 57 As seen in Table 3, r 2 values of 0.689 and 0.749 were obtained for equations 6 and 7, respectively. The q 2 (LOO) and r 2 -q 2 (LOO) are other measurement criteria for evaluating the performance of QSAR models, which should be higher than 0.5 and 0.3, respectively. [58][59][60] The calculated values of these parameters for equations 6 and 7 are 0.483, 0.206 and 0.530, 0.210, respectively. The generated QSAR models are the result of the GA-MLR methodology based on a uni-objective (i.e., F) optimization function. The other two metrics 2 rm and 2 rm ∆ were determined to further assess the predictive ability of the QSAR models. 2 rm metric which was introduced by Roy and Roy determines the proximity between the observed and predicted activities for the data set. 32 It has been suggested that for the models with reliable predictive power, the values of  (test) calculated for the test (unseen) set. Its value, greater than 0.5 indicates the validity of the model. In this study, Q 2 (test) for the two developed models with three-and fourparameters are 0.670 and 0.691, respectively. These results demonstrate that both models have good predictive power and are reliable for the prediction of the antiproliferative activities of coumarin analogs. Furthermore, Eq. 7 has significantly higher prediction ability in comparison to Eq. 6 with P-value of close to zero. Figure 1 represents the correlation between the experimental and predicted pIC 50 values according to the equations 6 and 7 for the studied coumarin compounds (total data). The resulted correlation coefficients of 0.688 and 0.717 between observed and calculated activities using Eq. 6 and Eq. 7, respectively, demonstrate the reliability of the proposed models for predictive purposes.

Conclusion
In the present study, QSAR analysis was performed using GA-MLR method to construct models for predicting the antiproliferative activities of coumarin derivatives as potential anticancer compounds. The internal and external validation methods were used to investigate the predictive performance of the two developed MLR models. The calculated validation parameters showed that both of the models could predict biological activities of coumarins well. Based on the obtained results, the predictive power and the performance of the model with four descriptors (Eq. 7) is higher than the model with three descriptors (Eq. 6) owing to the inclusion of one more significant variable (Mor04p) in the model. Our findings could be helpful in estimating the activity as well as in designing, synthesizing, and developing the novel anticancer drugs based on coumarin scaffold.

Ethical Issues
Not applicable.

Conflict of Interest
Authors declare no conflict of interest in this study.