Elucidating ‘undecipherable’ chemical structures using computer‐assisted structure elucidation approaches

Structure elucidation using 2D NMR data and application of traditional methods of structure elucidation are known to fail for certain problems. In this work, it is shown that computer‐assisted structure elucidation methods are capable of solving such problems. We conclude that it is now impossible to evaluate the capabilities of novel NMR experimental techniques in isolation from expert systems developed for processing fuzzy, incomplete and contradictory information obtained from 2D NMR spectra. Copyright © 2012 John Wiley & Sons, Ltd.


Introduction
During the last two decades, we have witnessed dramatic developments in new approaches to elucidate molecular structures. 1D and especially 2D NMR spectroscopy are the main sources of structural information utilized to extract chemical structures from analytical data. Even when reference 1D 13 C spectra are unavailable, the chemical shifts can be extracted from the 2D spectra (HSQC and HMBC spectra). However, it is frequently difficult to unambiguously elucidate a single molecular structure from the experimental spectra. Detailed analysis of 2D NMR data commonly shows that the data are fuzzy in nature. [1] For example, there may be severe overlap in the 2D cross peaks; 2D NMR correlations in long-range heteronuclear experiments can vary from two to four bonds in length (such non-standard correlations were detected iñ 45% of 2D spectra of 250 natural products analyzed [2] ); data may be incomplete (not all expected 2D NMR correlations are observable) and, rather frequently, contradictory strong and medium 2D NMR peaks related to correlations of n J HH , CH type, n > 3, can be present. 2D NMR has continued to extend with new experimental techniques on an ongoing basis: for instance, H2BC, 1,1-ADEQUATE, 1,n-ADEQUATE and a combination of hyphenated approaches such as HSQC-TOCSY (see reviews [3,4] ). The application of these techniques can help in removing the uncertainties in the topological distances between some of the correlating nuclei and directly influences the ability to elucidate the structure of an unknown compound.
In parallel with the development of new 2D NMR techniques, new approaches for computer-assisted structure elucidation (CASE) have been developed. Starting from pioneering works [5][6][7][8] published in the late 1960s, researchers created several generations of CASE expert systems. [9] Contemporary expert systems are powerful analytical tools capable of assisting in the elucidation of very complex molecular structures. These systems adequately mimic the systematic reasoning of spectroscopists but significantly outperform the human expert in logical-combinatorial reasoning. All assumptions used during the structure elucidation process can be considered as a set of 'axioms'. [1] An expert system deduces all logical corollaries (structures) that follow from this set of 'axioms' without exclusion. The number of corollaries (structures) is finite because the number of isomers corresponding to a molecular formula is limited. The selection of the most probable structure from those enumerated can be performed by using NMR chemical shift prediction methods. [10,11] Experience accumulated in the application of expert systems [9] shows that CASE methods can dramatically accelerate the procedure of structure elucidation, provide improved reliability of the structure elucidation and, as a result, save researchers significant amounts of time. Ultimately, expert systems should be considered as synergistic tools to support spectroscopists. They are unlikely to ever fully replace experts as situations will certainly occur where existing algorithms will be unable to outperform human intellect and expertise. The capabilities and limitations have been reviewed in detail elsewhere. [9] It should be noted that the processing of 2D NMR spectra, and specifically the extraction of the 1 H and X nucleus peak shifts, especially 13 C, can be a very time-consuming process. In many cases, the extraction of the shifts from the data can be the main component of the total time for solving a problem using an expert system.
In this article, we will try to show that modern CASE expert systems should be considered as an integral part of a spectroscopist's armory for the quick and reliable structure elucidation of molecular structures. Two examples of structure elucidation from 1D and 2D NMR data will be considered where experienced spectroscopists were confronted with insuperable difficulties. [12,13] Solution to these problems using the CASE approach allows us to assert that it is now impossible to evaluate the capabilities of NMR experimental techniques in isolation from expert systems developed for processing fuzzy, incomplete and contradictory information obtained from 2D NMR spectra.

Results and Discussion
Example 1. Structure elucidation from extremely contradictory 2D NMR data Kummerlöwe and co-workers [12] investigated one of the products obtained by reacting an azide-containing 1,5-enyne in the presence of electrophilic iodine sources. Initially, the researchers tried to elucidate the structure of this new compound by using classical methods commonly employed in such cases. High resolution mass spectrometry unambiguously provided the molecular formula for the unknown: C 16 H 18 NI, m/z = 351.0486 [351.0484 calculated for C 16 H 18 NI (M + )]. The following spectroscopy data were acquired at the first stage of the investigation: IR spectrum, 1D 1 H and 13 C spectra in combination with two-dimensional COSY, HSQC, 1 H-13 C HMBC and 1 H-15 N HMBC experiments. Eleven fragments were identified from the data: a phenyl group, a methyl group, five methylene groups (three forming an isolated chain), a tertiary nitrogen atom, an iodine atom and four quaternary carbon atoms. The 1 H-13 C HMBC spectrum revealed 63 long-range correlations and the 1 H-15 N HMBC spectrum exposed seven cross peaks, thereby correlating almost every fragment with every other fragment and indicating a very compact structure. Because it was difficult to deduce the structure from these data, a 2D 1,1-ADEQUATE spectrum [14] was also recorded on a Bruker Avance 900 MHz spectrometer (Bruker Biospin, Rheinstetten, Germany) equipped with a 5-mm cryogenically cooled TXI probehead optimized for proton detection. The 1,1-ADEQUATE data did identify adjacent quaternary carbons unequivocally. Although this was useful information, this additional data did not help to elucidate the structure.
Because classical NMR analysis failed, the authors [12] decided to make an attempt to solve the problem in an unconventional way by using residual dipolar couplings (RDCs). [15] In accordance with the methodology associated with RDC, they assumed that as long as sufficient anisotropic parameters can be measured and a large enough set of structural models can be constructed, it should be possible to identify the correct chemical structure.
To measure the RDCs, the compound was aligned in a stretched polystyrene/chloroform gel. The corresponding scalar couplings were measured in a chloroform solution sample. Fourteen proposed structures, including several models that were unlikely (see Fig. 1), were tested using the experimental data. Analysis of the RDC data suggested that structure #2a is the correct one.
To confirm the structure suggested by the RDC data, almost 100 mg of the reaction product was synthesized, and a 2D INADEQUATE spectrum [14] was acquired using 3 days of spectrometer time. The structure #2a elucidated using the RDC data was unambiguously confirmed by the INADEQUATE data. In addition, labeling the starting material of the reaction with 15 N-azide and measuring 13 C-15 N couplings for the 15 N-labeled compound were performed. Both additional experiments clearly supported structure #2a.
Posterior data analysis showed that the 1 H-13 C HMBC spectra contained nine so-called 'nonstandard' correlations (NSCs) (those having n J HC , n > 3). [16] This is not surprising considering that the molecule is a highly rigid system. The CASE program interprets the combinations of the available 1 H-1 H and 1 H-13 C correlations to derive carbon-carbon connectivities and to produce nine nonstandard C to C connectivities (see Fig. 2), which was the main cause preventing structure elucidation using a traditional approach. The initial system of 'axioms' used for the structure elucidation from HMBC data [1] became extremely contradictory because of the presence of nonstandard connectivities. Moreover, two unexpected intense 5 J CH cross peaks correlating two protons with the ortho-carbons of the phenyl group (see Fig. 2) were identified in the 1 H-13 C HMBC spectrum. The corresponding part of the HMBC spectrum is presented in Fig. 3, taken from the supporting information of Kummerlowe et al. [12] We suggest that this can be explained as a result of the hindered rotation of the phenyl group due to the large volume of the iodine atom.
At the same time, the authors [12] found that structure #2a was almost certainly excluded from the potential set of structures because the 13 C chemical shifts predicted by ChemDraw (CambridgeSoft Corp., Massachusetts, USA), [17] and presented in this work, differed significantly from the experimental data (see Fig. 4 where the results obtained by neural net algorithm [11] are shown for comparison). The mean average error was 4.65 ppm with linear regression described by R 2 = 0.982, which indeed can be considered as a hint to conclusion that the structure #2a is questionable.  wileyonlinelibrary.com/journal/mrc The highly complex nature of the 2D NMR data prompted the authors to conclude that the problem could not be solved by a classical approach. In making this decision, they only considered the NMR data in isolation from algorithmic-assisted approaches such as those available in CASE software such as Structure Elucidator (ACD/Labs Inc., Ontario, Canada). [18] This software package has been applied for over a decade to solve real-world problems. The experimental data presented in the work [12] were therefore analyzed using the software program with several modes of problem solving examined. Run 1. The molecular formula, 1D 13 C, HSQC, 1 H-13 C HMBC and 1 H-15 N HMBC spectra were input into the program. All five HMBC peaks marked in [12] as very weak were ignored for the first run to reduce the possible number of NSCs. A Molecular Connectivity Diagram (MCD) [18] was automatically created as shown in Fig. 5.
As a result of the logical analysis of the MCD, the program discovered the presence of NSCs in the HMBC spectrum, which suggested that an approach we have termed 'Fuzzy Structure Generation' (FSG) was necessary. [16,19] It should be noted that   . The regression plot of the 13 C chemical shifts calculated using the neural net algorithm [11] (triangles) and using algorithms contained within ChemDraw (squares) versus the experimental values. The first value shown in the box is the experimental shift, and the second value is that calculated by the ChemDraw program. Figure 5. A Molecular Connectivity Diagram. The atom hybridization assigned by the program is marked by the following colors see online version: sp 3blue; sp 2violet; not spbrown (94.6). The label 'fb' denotes the prohibition of being adjacent to neighboring heteroatoms. Note that the quaternary carbon C (94.6) is marked as not sp as its hybridization is allowed to be either sp 2 or sp 3 . the FSG mode freely allows the long-range correlation lengths to be varied to any extent. FSG was run assuming that the HMBC data contain an unknown number of NSCs, each of them being of unknown length.
No assumptions or user interventions were used. As a result of structure generation accompanied by spectral and structural filtration, [18] three possible structures were output in 13 min. 13 C and 1 H chemical shift predictions using the neural net algorithms [10,11] incorporated into the Structure Elucidator were performed, and the structural file was then ranked in ascending order of the 13 C chemical shift average deviation (Fig. 6). Although 1 H NMR prediction is generally a 'weaker' nucleus for rank-ordering the structures, in this case it proved to be of value in terms of providing additional confirmation of the structure. Figure 6 shows that the correct structure #2a was identified as the most probable structure, and its 13 C deviation is significantly (almost twice) smaller than that calculated with ChemDraw (see also Fig. 4). The chemical shift assignment for structure #1 (as shown in Fig. 6) suggested by the prediction algorithms fully coincided with that suggested by the authors. [12] The proposed structure #2b (#2 in Fig. 6) was also generated but was declined on the basis of the chemical shift predictions. Structure #3 results as a logical consequence from the experimental data but can be rejected because of the higher chemical shift deviations: Both the 1 H and 13 C prediction deviations are almost twice the size of those for the first ranked structure, and our experience [9] shows that such large differences remove the structure from consideration.
Run 2. All HMBC correlations, without any exclusion and including the set of nine NSCs, were used, and 1,1-ADEQUATE correlations were also added to the 2D NMR data (see MCD in Fig. 7).
Fuzzy Structure Generation was run with the following result: Only one correct structure, #1 (Fig. 6), was generated in 0.7 s. The application of the CASE approach therefore allowed us to instantly and unambiguously find the single correct structure from the HMBC and 1,1-ADEQUATE data. It has now been shown a number of times [20][21][22][23] that 1,1-ADEQUATE data in conjunction with other 2D NMR data is a very valuable data combination as input for CASE programs.

Example 2. Structure elucidation from incomplete 2D NMR data
The second example reviewed in this work was inspired by the article published by Gross and co-worker. [13] They suggested a new method of determining the structures of small planar molecules based on Atomic Force Microscopy (AFM) [24,25] (It should be noted that the authors interchangeably used the terms AFM and SPM in their article, and we are adhering to the use of AFM only in this article.). This approach would clearly make an excellent adjunct to the other tools available for organic structure analysis, and to validate its utility, they studied the natural product cephalandole A, (1), C 16 H 10 N 2 O 2 , Figure 7. A Molecular Connectivity Diagram. Atom hybridization assigned by the program is marked by the following colors see online version: sp3blue; sp2violet; not spbrown (94.6). The label 'fb' denotes the prohibition of neighboring with heteroatoms. 1,1-ADEQUATE connectivities are denoted by bold lines. Figure 8. Possible structures of cephalandole A proposed by Gross and co-workers [13] . which had previously been misassigned by Wu et al. [26] and later corrected by Mason et al. [27] . The authors [13] explain that this compound was selected for testing the AFM method because it meets all three criteria specified previously that render structure analysis especially challenging: [28] The ratio of heavy atoms to protons is ca 2 : 1, and the O and N atoms at positions 1 and 4 respectively interrupt the carbon skeleton completely, separating the two parts of the molecule. In addition, the carbonyl at C2 is distanced from the nearest proton by four bonds and is not expected to show correlations in an HMBC experiment. The molecular formula indicates that there were 13 degrees of unsaturation in the structure. 1 H-13 C HMBC and very sparse COSY data were used by the authors [13] to elucidate the structure. On the basis of NMR data analysis, the authors suggested four structures consistent with the available data (see Fig. 8). The authors comment that the available NMR data did not allow distinction between a 2-or 3-substituted indole substructure, and therefore, all four structures #1-#4 could be considered plausible. Structure #1 is the accepted structure of cephalandole A, and structure #2 is the previously misassigned structure of this compound.
Gross and co-workers have demonstrated that the AFM approach combined with quantum-chemical computations is really capable of helping to select structure #1 as the most probable one using the analysis of molecular images, and this gives spectroscopists a new independent tool to distinguish planar molecules that may have similar structures.
The problem of elucidating the cephalandole A structure posed by Gross and co-workers was used by us as another challenge for CASE expert systems. The 1D and 2D NMR spectra acquired by the authors [13] to analyze this problem were input into Structure Elucidator and the MCD was created. No user intervention or data corrections were made. Checking of the MCD detected the presence of NSCs in the 2D NMR data, and the FSG mode was therefore employed for the structure elucidation. As a result, the program produced an output file of 11 structures in 1 min 50 s, and structure #1 was selected as the most ideal candidate using, as in the previous example, 13 C and 1 H NMR chemical shift prediction by the neural network algorithm. Structure #2 (Fig. 8), initially suggested by Wu and co-workers as the correct structure, was also generated and ranked ninth in the file with deviations of d N ( 13 C) = 3.70 ppm; hence, it should definitely be rejected. In our previous review, [1] we have shown that this structure would be immediately declined by the researchers on the basis of the 13 C chemical shift deviations calculated with the aid of Structure Elucidator. Figure 9 shows three (of 11) structures that are of similar shape and ranked as first, fourth and ninth.
One of the reviewers of this manuscript commented that relative to Gross et al.'s original statement that 'the available NMR data did not allow distinction between a 2-or 3-substituted indole substructure', a 3 J CC correlation in a 1,n-ADEQUATE spectrum could link the 2-position of the indole of structure #1 with the carbonyl, assigning the structure, if structures #1 and #2 had been selected based on 1,1-ADEQUATE data. Conversely, a 3 J CC correlation in a 1,n-ADEQUATE spectrum could link the 3-position of the indole of structure #3 to the carbonyl if structures #3 and #4 had been selected based on 1,1-ADEQUATE data. Hence, there are viable spectroscopic routes to the identification of the structure in Example 2 parallel to CASE methods. Another possible approach to the assignment of the structure is based on long-range 1 H-15 N 2D NMR. A 1 H-15 N HMBC spectrum optimized for~3 Hz would be expected to give correlations to both 15 N resonances in the case of structures #1 or #2. It would be ideal to perform these additional 2D NMR experiments and then feed those data into the CASE program for additional and very strong confirmation of the structure.
It is noteworthy that the proposed structures of #3 and #4 (Fig. 8) were not generated at all. A question arises from the analysis of Fig. 9: Because the generated structure #4 (Fig. 9) is more preferable than structure #9 (suggested by Wu et al.) and has the geometrical configuration similar to correct structure, is the AFM method capable of distinguishing between structures #1 and #4 shown in Fig. 9?

Conclusions
We have considered two examples of structures that were deemed too difficult to elucidate using traditional methods of 1D and 2D NMR spectra structural interpretation. In both cases, the researchers [12,13] used new, more challenging experimental techniques [15,24,25] to perform small molecule structure elucidation.
Using these examples, we have tried to demonstrate that the application of a CASE approach is a viable alternative to rather sophisticated and laborious methods and, in these cases at least, could solve both problems quickly and reliably. As pointed out by Reynolds and Enriquez, [29] the optimization of the experimental parameters associated with commonly used pulse sequence experiments can provide access to additional data that were initially unavailable using default parameters.
We conclude that a modern CASE expert system should be considered as an integral part of a spectroscopists' armory for quick and reliable structure elucidation. Our research has shown that it is now impossible to evaluate the capabilities of NMR experimental techniques in isolation from mathematical algorithms developed for 2D NMR data analysis and to logically infer all structures consistent with the experimental data and additional information. We believe that in the future, CASE software will become a common tool for NMR spectroscopists to apply, much like the software that is today an integral part of X-ray crystallography. Although this manuscript has demonstrated that CASE approaches can be applied to existing data, we do not wish to discourage the development of new and improved methods for generating new data to improve the success of correctly elucidating molecular structures. In particular, we are extremely encouraged by the work of Gross et al. to produce microscopy images of single molecules and are excited by the future possibilities of such an approach.