Prediction of critical micelle concentration for per- and polyfluoroalkyl substances

ABSTRACT In this study, we focus on the development of Quantitative Structure-Property Relationship (QSPR) models to predict the critical micelle concentration (CMC) for per- and polyfluoroalkyl substances (PFASs). Experimental CMC values for both fluorinated and non-fluorinated compounds were meticulously compiled from existing literature sources. Our approach involved constructing two distinct types of models based on Support Vector Machine (SVM) algorithms applied to the dataset. Type (I) models were trained exclusively on CMC values for fluorinated compounds, while Type (II) models were developed utilizing the entire dataset, incorporating both fluorinated and non-fluorinated compounds. Comparative analyses were conducted against reference data, as well as between the two model types. Encouragingly, both types of models exhibited robust predictive capabilities and demonstrated high reliability. Subsequently, the model having the broadest applicability domain was selected to complement the existing experimental data, thereby enhancing our understanding of PFAS behaviour.


Introduction
Per-and polyfluoroalkyl substances (PFASs) are purely anthropogenic chemicals that have been widely synthetized since the late 1940s [1,2].This family of compounds contains a large variety of chemical functional groups such as alcohols, ethers, esters, surfactants with a fully or partially fluorinated carbon chain.The carbon-fluorine bonds being one of the strongest in organic chemistry gives most of PFASs high chemical and thermal stabilities.This provides PFASs a wide range of applications among which waterproofing of textiles, oil and grease repellent for packaging, non-stick coating properties for household products, or even their use in fire-fighting foams [2][3][4].However, the exceptional properties of PFASs also have their drawbacks, namely as they are resistant to degradation and persistent in the environment, making them labelled as 'eternal pollutants' or 'forever chemicals'.For instance, PFAS surfactants when released into the wild intentionally or unintentionally depending on the application, can accumulate in waste, surface and/or ground waters as well as in the soil according to their thermophysical properties [3,5].The livings can therefore be exposed to PFAS [6], and this has become a global concern due to the potential toxicity of these latter [7].PFASs pose a significant concern as endocrine disruptors due to their potential to interfere with hormonal systems [8,9].These substances have been linked to disruptions in hormone production, regulation, and function, contributing to adverse health effects in both humans and wildlife.Understanding the link between the structure of PFASs and their role in the environment is crucial for safeguarding public health and environmental well-being [10].Technologies for removing PFASs from the environment and/or degrading them have been proposed but still need an important research effort [11][12][13].
Although the PFAS family contains thousands of compounds, only a marginal number has been experimentally characterized to date, and data available in the literature are still scarce [14].It is clearly of the utmost importance to develop robust predictive models to overcome the lack of experimental information about PFASs.Sosnowska et al. recently reviewed Quantitative Structure-Property Relationship (QSPR) based models developed to predict properties of per-and polyfluoroalkyl substances, including vapour pressure, water solubility, critical micelle concentration (CMC), or even partition coefficients [15].As many PFASs are strongly surface-active molecules, the knowledge of their behaviour in aqueous solution as well as at interfaces is a key parameter for numerous processes [16].When the concentration of surfactant molecules increases, they tend to aggregate and form micelles.The concentration at which micelles occur defines the CMC.It has been hypothesized that the ability of PFAS surfactants to form micelles may play a role on their adsorption or solubility-related properties [17][18][19].
Only one CMC prediction model has been reported in the review proposed by Sosnowska et al. [15], originally developed by Bhhatarai and Gramatica [20], it was derived from ten experimental values, mainly CMC values for perfluoroalkanoic acids with a number of carbon atoms ranging from 4 to 10. Le et al. developed a group contribution (GC) model to predict surfactant surface excess at low concentration (K iwÀ 0 ), and then considered a proportionality rule between the ratio of K iwÀ 0 values and the ratio of CMC values, for two surfactants (one of which being unknown) [21].Le et al. assessed their predictive GC method on 25 PFASs, and data provided by the authors lead to a root mean squared error (RMSE) of 52.3 mM.
The prediction of critical micelle concentration for non-fluorinated surfactants using machine learning approaches has been the subject of numerous works [22].More recently, Qin et al. [23] and then Moriarty et al. [24] trained neural networks to predict the CMC using a similar dataset containing non-ionic, cationic, anionic, and zwitterionic surfactants, though it included a minor fraction of fluorinated compounds.To the authors' knowledge, the used collection of 202 surfactants curated by Qin et al. is currently the largest public dataset of CMC values for surfactants.Authors used molecular graphs to encode surfactant structures, resulting in models with good performances.Noteworthy, these studies identified features of the hydrocarbon tails as the primary molecular fragment contributors to model predictions [24].
In this work, we address the development of QSPR-based model for predicting the critical micelle concentration of per-and polyfluoroalkyl substances.In the following section, the data collection and curation methods are presented as well as the methodology followed to develop new QSPR-based models.The predictive performances of models are then discussed as well as their use for external data prediction.

Materials and methods
In recent years, our group has devoted considerable effort in applying machine learning techniques to chemical databases in order to derive QSPR-based models for predicting various property values [25][26][27].The basis of such approaches stands in the identification of non-obvious correlations between property values and some descriptors encoding information about compounds.Reviews containing all the elements regarding the development and application of QSPR-based models, and best practices in developing such models, are available in the literature [28,29].

Database
The quality of data used as reference during the development of QSPR-based models is of the utmost importance for models' robustness and is often the keystone of such a modelling work success.As previously highlighted, although the PFAS family contains thousands of compounds, data available in the literature remains scarce.A compendium of CMC data for PFASs available in the literature was achieved [20,21,23,24,[30][31][32][33][34][35][36][37].It should be emphasized that no data originating from predictive modelling or even extrapolation from data fittings has been considered.Per-and polyfluoroalkyl substances were of main interest in this work.We compiled a dataset of 60 CMC values measured at 25°C, which, to the best of our knowledge, represents the most extensive collection of such data for PFASs.The final CMC values resulted from the average of the available data points unless significant discrepancies were observed between sources.In this latter case, a consistency analysis was performed against other CMC values in the databaseconsidering duplicates or trends within a family of molecules.This analysis allowed for the identification and removal of questionable values.For instance, Le et al. reported for PFHxS-K (perfluorohexanesulfonic acid, potassium salt) a CMC of 57 mM while references [33] and [32] indicate values of 8 mM and 12 mM, respectively.Among the 60 data points collected, nine correspond to non-ionic surfactants with partially fluorinated and hydrogenated skeletons containing chemical functions such as alcohol, ether or thioether.The remaining PFASs are ionic surfactants with tails totally fluorinated or partially hydrogenated and some of them include ether functions.PFASs surfactants heads' chemical functional groups are in decreasing order of occurrence in the database: carboxylic acid, sulphonate, and trimethylammonium functions.Additionally, various aqueous solutions and salts were under study leading to the presence of different counterions' types in our data collection, i.e.H + , Li + , Na + , Mg 2+ , NH 4 + , K + , and Br − .
Considering the diversity of chemical functional groups within our compendium and the still quite limited total number of PFASs structures, the CMC data collected, merged, and curated by Qin et al. [23] (database then reused by Moriarty et al. [24]) was employed to supplement the content of our data collection.This dataset contains 199 CMC values for different classes of surfactants in aqueous solutions for temperatures in between 20°C and 25°C.As detailed hereafter, Qin et al. dataset will be used during model training stage to reinforce the chemical information contained in our PFASs database.

Molecular descriptors
From conclusions drawn in our previous studies [25,38,39], descriptors calculated on the basis of the chemical and structural formulae were under study and hereafter are labelled as functional group count descriptors (FGCD).In this latter family of molecular descriptors are included counts of atoms and groups of atoms identified as relevant from chemical aspects.It was demonstrated that such a simple representation of compounds provides relevant descriptors usable for the development of QSPR [40].FGCD under consideration in this study, labelled from X1 to X63, are listed in Table 1.As an example, the FGCD labelled X12 denotes the number of CF 3 groups, X15 the number of CF 2 groups.Some descriptors implicitly take into account branching effects, X18 counts the number of quaternary carbon atoms, and X16 and X17 count the numbers of carbon atoms baring one branching for hydrocarbons and fluorinated compounds, respectively.The descriptor X1 (MM) denotes the molar mass of neat compounds.Also, we ensured that sequences of atoms having the most importance in the generic model developed by Moriarty et al. [24] are represented in our list of descriptors, e.g.X14, X21, X42 to X47, X56, and X57.Simplified molecular input-line entry specification (SMILES) codes were assigned to each neat compound within the database.FGCD were automatically generated using RDKit software [41] and SMILES arbitrary target specification (SMARTS) matching functionalities [42], SMARTS codes corresponding to considered FGCD are given in Table 1.
Table 1.Molecular descriptors selected for the development of models.
The complete database containing SMILES codes and the corresponding CMC experimental values used in this work is available in Table S1 in the supporting information.

Chemical space representation
The chemical information contained implicitly within our database was pre-processed by applying a principal component analysis (PCA) on X1 to X63 descriptor values.The first two principal components resulting from the PCA were used as an approximated graphical representation of the chemical space of our database, and Figure 1   Two approaches were investigated to derive models: in the first one, only the PFASs are considered, while in the second, fluorinated and non-fluorinated compounds are taken into account.Given the focus on a model for PFASs CMC, all non-fluorinated compounds are systematically used during the learning processes of the second model type, as discussed hereafter.

Machine learning modelling
During last decades of QSPR model developments, the use of external validation has been shown necessary to ensure their application to new fluids, i.e. not considered within the dataset used to train the model [43].Its popular version is the n-fold cross-validation (n-CV) in which the dataset is randomly divided in approximately equal n portions.A concatenation of (n-1) portions forms a training set -later used to train models -and the remaining portion constitutes an external set labelled as test set -later used to evaluate model performances.We emphasize that no data point belonging to external sets was used to derive models.This procedure is repeated n times choosing for each a new portion of data as an external set, leading to n different models.To avoid any strong violation of the applicability domain of models during the cross-validation procedure, we fixed five of the PFAS molecules in a specific fold which will always be used to form training sets.An additional external set labelled as validation set was built by randomly selecting PFAS molecules, and we emphasize that this set was only used to evaluate performances of the final model.Due to the limited number of PFAS in our database, the validation set contains five molecules.In this work, a 5-CV was applied to the 50 remaining PFASs molecules (details about fold assignment for each compound are provided in Table S1 in the supporting information), and for instance, when considering only fluorinated compounds the training, test and validation sets represent 75%, 17% and 8% of the database, respectively.The same fold assignation is used for the two types of models, i.e. derived on the basis of fluorinated molecules only, and fluorinated and non-fluorinated compounds.
We demonstrated in a number of previous works that the combination of molecular descriptors such as FGCD and Support Vector Machines (SVM) provides accurate solutions in terms of property modelling [25][26][27]39,40].The applied methodology described hereafter is very similar to that followed in our previous works [25].The LibSVM library [44] was employed to derive Support Vector Regression (SVR) with both linear and radial basis function kernels, and with an epsilon insensitive zone.Therefore, values for three SVR hyperparameters need to be optimized: cost, epsilon, and gamma.The approach previously proposed by Gantzer et al. was used to optimize the SVR hyperparameter values [45], integrated within a 5-CV procedure.Finally, a model was developed using the set of optimized hyperparameters and considering either the 55 PFASs solely or together with non-fluorinated compounds.
All generated models are evaluated according to their ability to predict experimental CMC values for PFASs only.Predicted values are compared to reference experimental data, and performances of models are evaluated by means of metrics such as Mean Absolute Error (MAE, equation ( 1)), Root Mean Squared Error (RMSE, equation ( 2)), or the coefficient of determination (r 2 , equation ( 3)), defined respectively as: with y i the predicted value, x i the experimental value, � x the average of experimental CMC values, and N is the number of data points in the considered set.MAE and RMSE values have the unit of the property under consideration, r 2 is unitless.

Results and discussion
It is now commonly assumed that within a family of surfactants (specific head group and carbon tail) and for a defined temperature condition, the CMC decreases exponentially when the number of carbon atoms (n c ) increases in the surfactant's hydrocarbon tail.This has given rise to the simple fitting expression as follows: where A and B are two empirical parameters, and B < 0 [46][47][48].Other similar expressions with additional terms were developed to account for the impact of groups such as oxyethylene group [46,47].Based on this information, each collected CMC value (expressed in mM) was transformed calculating its logarithm.Log CMC will be used hereafter as the endpoint during the modelling stages.Employing a logarithmic CMC aligns with a physics-aware learning approach.This strategy helps the training process in capturing a fundamental experimentally observed behaviour and prevents the prediction of non-physical negative values.
As previously mentioned, two types of models were developed, and SVR models of type (I) were derived only considering the 55 fluorinated compounds.When considering solely PFAS structures, descriptors X8-X10, X22, X32, X35, X36, X41, X48, X54, X57, and X61 returned zero value, they were consequently inactivated during the modelling procedure.A 5-CV was applied resulting in a splitting of the database into five folds (each containing 10 compounds), plus one additional -Fold-00 -containing five molecules fixed in the training set to avoid violation of the applicability domain.The cost, epsilon, and gamma hyperparameters for the SVR were optimized considering the six folds.Table 2 presents RMSE and r 2 values calculated on the training and test sets for the five ephemeral models generated during the 5-CV.It shows that values of metrics are roughly stable from one decomposition to another with for instance, mean RMSE and r 2 values of 0.097 (log mM) and 0.988 on training sets, respectively.Regarding test sets, mean RMSE and r 2 values are 0.252 (log mM) and 0.913, respectively.SVR models of type (II) were developed considering the 55 fluorinated compounds plus the 199 surfactants extracted from the works of Moriarty et al. [24].A 5-CV was applied, using the same folds as those defined for models' type (I) which enable fair comparisons.
Thus, Fold-01 to Fold-05 contain each 10 compounds, and Fold-00 contains 5 PFASs plus 199 non-fluorinated surfactants.Table 3 presents RMSE and r 2 values calculated on the training and test sets for the five ephemeral models generated during the 5-CV.It shows that values of metrics are roughly stable from one decomposition to another with for instance, mean RMSE and r 2 values of 0.095 (log mM) and 0.988 on training sets, respectively.Regarding test sets, mean values for RMSE and r 2 are 0.273 (log mM) and 0.899, respectively.Tables 2 and 3 show that the lowest R 2 values are obtained for Fold-04.This portion of the PFAS dataset contains the sodium tridecafluorohexanoate (TDHP) for which an experimental CMC of 26.4 mM was reported [21].When TDHP is in external set (Fold-04), types (I) and (II) models strongly overestimate the experimental value with predicted CMC of 71 mM and 75 mM, respectively.This contributes to strongly degrade the performance of models on this external set.
Figure 2 presents the parity plot of experimental vs. predicted CMC values using SVR ephemeral models of types (I) (empty red circles) and (II) (full grey circles), for PFASs in test sets during the 5-CV.The diagram clearly illustrates the similar performances for models of types (I) and (II).Furthermore, it reveals that both types of models lead to very similar predicted values for each compound, even when these deviate from the experimental CMC reference values.This suggests that although the applicability domain of type (II) models is much wider than that of type (I) models, the latter (specific to PFASs) exhibited comparable predictive performances.
From these observations, a final SVR type (II) model was then trained using the set of hyperparameters optimized (cost = 3301.24;epsilon = 0.929632; gamma = 0.23709) during the previously described 5-CV procedure and considering the whole dataset -the 254 fluorinated and non-fluorinated compounds -in the training set.Metrics for the soobtained SVR-based model are r 2 = 0.966, RMSE = 17.1 mM and MAE = 3.7 mM. Figure 3 presents the parity plot of experimental vs. predicted CMC values using the SVR model over the entire database.All data points are not too scattered on both sides of the bisector, indicating that the predicted values are in good agreement with reference data.However, some data points appear to have been overestimated (about one order of magnitude) by the model, they mainly correspond to non-fluorinated compounds.In the upper right-hand part of Figure 3, the use of the logarithmic scale leads to misleading observations, as large absolute deviations are squashed.This area mainly refers to compounds with a small number of carbon atoms, the largest absolute deviation is found for the perfluorobutanoic acid (PFBA) for which the model predicts a CMC of 549 mM whereas the reference experimental value is 759 mM.It should be noted that the model developed by Bhhatarai and Gramatica in 2011 although trained on ten reference CMC values, also strongly underestimated (631 mM) the experimental value [20].For fluorinated compounds on Figure 3, the point graphically appearing the farthest from the bisector stands for the sodium perfluorodecanoate (PFDA-Na) for which the prediction is 2.06 mM whereas an experimental value of 0.91 mM was found in the literature.However, if we consider the sodium perfluoroalkanoic acid (PFAA) series (where only n c varies in the linear fluorinated carbon tail), this latter value (n c = 10) does not correspond to the expected trends for the homologous family -full red triangles and red dashed line in Figure 4. On the contrary, the predicted value is in line with that trend.Figure 3 also presents a comparison between predicted and reference values for the five PFAS constituting the validation set.We emphasize that none of these molecules was considered Figure 4 presents some tendencies of CMC for some PFAA with different counterions such as Li + , Na + , NH 4 + , and K + , when increasing n c in the linear carbon chain.Values predicted using the SVR models of type (II) are compared to available experimental data, but do not take into account the water solubility of the compounds, which could be estimated using modelling approaches [49].The behaviour suggested by the empirical Equation ( 4) is satisfied in every case with more or less pronounced deviations from the expected linear trend, especially at the lower n c values (below six carbon atoms), for which we can assume that the head plays a more important role.Changing the counterion has an impact on the micellization behaviour, and predictions agree with Lunkenheimer et al. who studied the role of alkali ions on the CMC of perfluorooctanoic acid (Li + > Na + > K + ) [50,51].Table 4 presents values for the two parameters in Equation ( 4) when regressed on values predicted using the final SVR-based model of type (II) for PFAA and counterion systems.There is no obvious trend between parameter values and ion characteristics.
Rosen and Murphy suggested a relation between the parameter A and the hydrophilic group of the surfactant, and between the parameter B and the standard-free energy of micellization in water [52].Figure 5 presents some tendencies of CMC for some perfluoroalkane sulfonic acid (PFASA) with different counterions such as Li + , Na + , NH 4 + , and K + , when increasing n c in the linear carbon chain.Values predicted using the SVR models of type (II) are compared to available experimental data, but as for Figure 4, they do not account for the water solubility of compounds.SVR predictions exhibited trends aligning with the empirical Equation ( 4) with some deviations from the expected linear trend depending on the PFASA and counterion system, especially at the lowest n c values (below six carbon atoms).Table 4 presents values for the two parameters in Equation ( 4) when regressed on values predicted using the final SVR-based model of type (II) for PFAA plus counterion systems.There is no obvious trend between parameter values and ion characteristics.In Figure 5, for the system PFASA-Li, we  initially found in the literature two similar experimental CMC values (6.3 and 6.4 mM) but for two different n c of 6 and 8 (full grey squares).The SVR model seems to discriminate between these two values and to agree with the value for n c = 8.Furthermore, the SVR model underestimates the reference experimental CMC value (8 mM) for the system PFASA-Na with n c = 7.As we can see in Figure 5, the counterions not only impact the relative CMC values for the same carbon lengths of PFSAs, but also change the slope of the logarithmic trend.These predictions reflect the ability of our SVR models to capture the complex effects of counterions over the CMCs of ionic perfluoro-surfactants.

Conclusions
Per-and polyfluoroalkyls are substances for which there is currently a great interest in terms of ecotoxicological, toxicological, and physicochemical knowledge.Experimental property values related to the behaviour of PFASs in aqueous solutions lack in the literature, and this is typically the case for the critical micelle concentration.In this work, we performed a literature survey to extract experimental CMC values for fluorinated and non-fluorinated compounds, and then constituted a database which is available in Table S1 in the supporting information.We have proposed the application of chemoinformatics methodologies to this database in order to develop models capable of predicting the CMC of per-and polyfluoroalkyl substances.
Two types of models were derived from the application of the SVM algorithm to the database, within a 5-CV procedure.Only data points related to fluorinated compounds are used to develop type (I) models, while the entire contents of the database are taken into account to generate type (II) models.We emphasize that in both cases, exactly the same folds are used during the cross-validation, except Fold-00 -always employed in the training set -which for type (I) contains 5 PFASs and for type (II) 5 PFAS plus all the nonfluorinated compounds.Comparisons performed with respect to reference data revealed the high predictive reliability and robustness of our SVR models.This confirmed that our SVR model is suitable for predicting CMC values for compounds lacking extensive experimental information in the literature.
The SVR model of type (II) was used to generate predictions for compounds within homologous series of perfluoroalkanoic and perfluoroalkane sulfonic acids.It provided an interesting complement to the experimental information available -by defining the parameters of the empirical linear equation -and also highlighted questionable data in our database.
represents projections of compounds within this bidimensional chemical space.This representation shows that our descriptor set enables us to distinguish between fluorinated (open circles) and non-fluorinated (triangles) compounds in the database.Most PFASs are located in a small area of the diagram.A separated cluster composed of four PFASs can be observed in the upper left part of the diagram, these are polyfluoroalkyl substances involving carboncarbon double bonds and ether groups.Some other PFASs are on the border between fluorinated and non-fluorinated compounds, these involve a thioether or trimethylammonium function.

Figure 1 .
Figure 1.Projections of compounds into the approximated chemical space formed by PC1 and PC2, the two first principal components resulting from the PCA performed on descriptor values.The 60 perand polyfluoroalkyl substances are represented with empty black circles, and the 199 surfactants with purple triangles.The percentages of variance explained by PC1 and PC2 are 17% and 12%, respectively.

Figure 2 .
Figure 2. Scatterplots of experimental vs. predicted CMC values (mM) using ephemeral SVR models of type (I) (red empty circles) and type (II) (grey circles), for PFASs in test sets during the 5-CV.The dashed line stands for the bisector of the diagram.A logarithmic scale is used due to the number of CMC orders of magnitude covered in the dataset.

Figure 3 .
Figure 3. Scatterplots of experimental vs. predicted CMC values (mM) using the final SVR models of type (II).The 55 per-and polyfluoroalkyl substances are represented with empty black circles, and the 199 surfactants with purple triangles.PFAS molecules belonging to the validation set are in orange circles.The dashed line stands for the bisector of the diagram.A logarithmic scale is used due to the number of CMC orders of magnitude covered in the dataset.

Figure 4 .
Figure 4. Tendencies of the CMC evolution for some perfluoroalkanoic acids (PFAA) when increasing n c , and for various counterions (Li + , Na + , NH 4+ , and K + ).The parameter n c denotes the number of carbon atoms.Lines represent the data predicted using the SVR models of type (II), and symbols denotes experimental data.

Figure 5 .
Figure5.Tendencies of the CMC evolution for some perfluoroalkanesulfonic acids (PFASA) when increasing n c , and for various counterions (Li + , Na + , NH 4 + , and K + ).The parameter n c denotes the number of carbon atoms.Lines represent the data predicted using the SVR models of type (II), and symbols denotes experimental data.

Table 2 .
RMSE (log mM) and r 2 values calculated on training and test sets for the ephemeral SVR-based models' type (I) generated during the 5-CV applied on the 55 PFASs.

Table 3 .
[24] (log mM) and r 2 values calculated on training and test sets for the ephemeral SVR-based models' type (II) generated during the 5-CV applied on the database containing the 55 PFASs and 199 surfactants used by Moriarty et al.[24].

Table 4 .
Values of the parameters in equation (4) regressed on predictions using the final SVR-based model of type (II).