A quantitative map of protein sequence space for the cis-defensin superfamily

Motivation: The cis-defensins are a superfamily of small, cationic, cysteine-rich proteins, sharing a common scaffold, but highly divergent sequences and varied functions from host-defence to signalling. Superfamily members are most abundant in plants (with some genomes containing hundreds of members), but are also found across fungi and invertebrates. However, of the thousands of cis-defensin sequences in databases, only have a handful have solved structures or assigned activities. Non-phylogenetic sequence-analysis methods are therefore necessary to use the relationships within the superfamily to classify members, and to predict and engineer functions. Results: We show that the generation of a quantitative map of sequence space allows these highly divergent sequences to be usefully analysed. This information-rich technique can identify natural groupings of sequences with similar biophysical properties, detect interpretable covarying properties, and provide information on typical or intermediate sequences for each cluster. The cis-defensin superfamily contains clearly-defined groups, identifiable based on their biophysical properties and motifs. The organisation of sequences within this space also provides a foundation of understanding the evolution of the superfamily. A exploring and


Introduction Defensins
Defensins are small, cationic, disulphide-rich proteins found in almost all tissue types across animals, plants, fungi (Shafee et al., 2017). They are best known for their hostdefence and antimicrobial activities, but members have also been recruited to a variety of additional specialised functions including ion channel blocking, enzyme inhibition and cell signalling (Zhu et al., 2005;Van der Weerden and Anderson, 2013;Shafee et al., 2017). Furthermore, the same function can be mediated by different mechanisms, even by closely related defensins (Vriens et al., 2014;Bleackley et al., 2016;Gao and Zhu, 2012).
The defensins actually consist of two superfamilies with independent evolutionary origins, the cis-and trans-defensins (Shafee, Lay, et al., 2016). The two superfamilies are defined by their structural similarity, secondary structure order, and disulphide topology (Shafee, Lay, et al., 2016). The most conserved feature of each superfamily is a pair of characteristic disulphides. In the cis-defensin superfamily, this disulphide pair is cisoriented ( Figure 1A), and sometimes called a cysteine-stabilised αβ fold. The transdefensins have a trans-oriented pair of disulphides ( Figure 1B), and include the defensins from humans and other vertebrates (Shafee, Lay, et al., 2016;Shafee et al., 2017). Extensive divergent evolution within each superfamily has elaborated this scaffold to form multiple families, typically defined by disulphide class (cysteine motif and disulphide connectivity), resulting in an array of markedly different overall structures. Adding to this, the term 'defensin' has sometimes also been used as a general functional descriptor of any protein with host-defence activity (Parisi et al., 2018).
The cis-defensin superfamily is most characterised in plants, with thousands of members (sometimes hundreds in a genome) (Silverstein et al., 2005). This number is growing rapidly as more genomes are sequenced (e.g. the 1KP project), and is likely an underestimate, since they are under-annotated by current methods (Silverstein et al., 2007). Additional cis-defensins are also widely distributed across cellular life in animals, fungi and bacteria (Shafee et al., 2017(Shafee et al., , 2018. What are all these proteins doing? Few have been functionally characterised and, although they're often assumed to be antimicrobial, even similar sequences can have very different activities (Vriens et al., 2014;Bleackley et al., 2016;Gao and Zhu, 2012;Zhu et al., 2014). Added to this, the sequences and structures of cis-defensins are extremely divergent (pairwise identities frequently <15% and typical length 35-60aa) due to their robust structures combined with strong selection pressures (Shafee, Robinson, et al., 2016). This diversity poses a problem for traditional sequence analysis methods, and specialised bioinformatic techniques often have to be used to address their limitations (Silverstein et al., 2007;Shafee, Robinson, et al., 2016). For example, phylogenetic analysis becomes unreliable for short, divergent, sequences under strong selection pressures, where similarities may be due to either common ancestry or convergence (homoplasy) (Wake, 1991;Rost, 1999). Sequence analysis has therefore been limited to closely related families (Zhu, 2008;Ma et al., 2012;Takeuchi and Higashiyama, 2012;Cabezas-cruz et al., 2016;Van der Weerden and Anderson, 2013).
Defensins are typically categorised into disulphide classes solely based on their cysteine spacing and connectivity. Although simple, this approach has several drawbacks. Many cysteine patterns contains members with different functions (e.g. the C6 class contains AMPs, toxins, and signalling proteins; The C8 class contains AMPs and sweet tasting proteins (Shafee et al., 2017)). Finally, the relationships between the classes remain illdefined. Consequently, the limited predictive power and biological meaning of the classes misses a global view of the superfamily's sequence-function relationships.
A better analysis approach for such a diverse superfamily is to build a quantitative map of protein sequence space, based on the biophysical properties of each residue in each sequence. In this work, I describe this analysis procedure, and its application to the cisdefensin superfamily.

Protein sequence space
Protein sequence space is a conceptual arrangement all possible protein sequences in a highly multidimensional space (Smith, 1970). At its simplest, the space has one dimension per amino acid in the sequence, to represent all possible combinations of amino acids. Proteins can then be arranged in this space based on their residues, with functional proteins occupying a vanishingly small fraction of the total space (Keefe and Szostak, 2001;Stemmer, 1995). Protein evolution can be thought of as the movement though sequence space as mutations explore new regions of it, with selection retaining only fit variants (Romero and Arnold, 2009). It is often used in a qualitative sense to aid discussion (Harms and Thornton, 2013), but is difficult to directly analyse in a quantitative manner.
Two key difficulties in generating quantitative maps of protein sequence space are a) how to ascribe meaningful, quantitative axes for the dimensions, and b) how to visualise and interpret such a high-dimensional space. Various metrics have been proposed for describing protein sequences numerically, from BLOSUM matrices (Casari et al., 1995), binary descriptors for each possible residue (Wang and Kennedy, 2014), and biophysical properties of residues or whole proteins (Rackovsky, 2009;Du et al., 2006). We favour the use of biophysical properties for each residue, since the resulting positions of sequences within sequence space more informative. The most divergent biophysical properties of the 20 naturally occurring amino acids were used in order to be easily interpretable (Atchley et al., 2005).
The map of cis-defensin sequence space contains clear groups of sequences. Importantly, the main functions of the superfamily (such as antimicrobial, immunogenic, or neurotoxic activities) are clearly separated, enabling identification of the key biophysical properties that determine those functions. Regions of sequence space that are densely or sparsely populated are clearly identifiable, as are sequences with intermediate properties between groups. It is therefore a useful method of classification and analysis for sequence-diverse superfamilies like the defensins.

Results
A set of evolutionarily related cis-defensin sequences were gathered by a combination of structural and sequence similarity searches, as previously published (Shafee, Lay, et al., 2016). Sequences were aligned using methods specialised for CRP sequences, which constrain homologous cysteines (and therefore inter-cysteine regions) based on solved structures (Shafee, Robinson, et al., 2016). The resulting Multiple Sequence Alignment (MSA) contains indels mainly in solvent-exposed loop regions and is consistent with structure alignments and homologous secondary structure (Fig S1).
To generate meaningful sequence space axes, residues must be placed along each axis in a meaningful order. In this work, proteins were placed in a quantitative sequence space using the biophysical properties of each residue in the MSA (charge, hydrophobicity, molecular weight, disorder propensity, disulphide potential, MSA column occupancy; Table S3). In this case, each defensin sequence was represented by a vector of its residues' biophysical properties in a 1176-dimensional sequence space. This highdimensional space was then rotated and projected into a smaller number of dimensions that summarised the main sources of covarying properties by Principal Component Analysis (PCA). This process captured the important features of the sequence space in a human-comprehensible number of dimensions, analogous to a 3D shadow of a megadimensional object. Within the projected sequence space, unsupervised learning was then used to identify naturally-occurring clusters of sequences with similar biophysical properties ( Figure 1C).

Figure 1 | Quantitative map of sequence space for the cis-defensin superfamily.
Secondary structure and disulphide topology that define the (A) cis-defensin and (B) transdefensin superfamilies with conserved disulphides in yellow. (C) Each point represents a sequence. Axes are principal components 1, 2, and 3 of the mega-dimensional sequence space. Sequences are coloured according to the identified Bayesian model-based groups and named as

Key functions reside in well-defined regions of sequence space
The axes of the projected sequence space (Principal Components, or PCs) summarise the most significant covarying sequence properties. The first three PC dimensions of the sequence space separate out three key features of the cis-defensin superfamily ( Fig S2). PC1 separates the plant and animal sequences. PC2 separates the antimicrobial proteins from the neurotoxins. PC3 separates those defensins with signalling functions. The fourth PC further separates a group of histidine-rich defensins ( Fig S3). Further variance is captured by subsequent PCs, however higher PCs do not give much additional separation into clusters ( Fig S4).
The sequences form seven well-separated clusters in the sequence space (Table 1), with voids between them sparsely populated by sequences with intermediate properties.
Groups were automatically assigned by Bayesian model-based clustering to identify naturally occurring groups of proteins. Each group was additionally analysed separately by repeating the PCA and clustering process to identify subclusters ( Fig S5, Fig S6). This produced a broad hierarchy within the sequence space. Phylogenies could be resolved for some subclusters and illustrated how the subcluster had evolved through the sequence space ( Fig S3E). This greatly extends on the standard classification system based on cysteine motifs, since sequences with the same motif are further organised within the clusters in sequence space and clear sets of covarying biophysical properties can be identified. , whilst group 2 is the largest cluster and contains most of the characterised plant defensins (e.g. NaD1). They contain mostly 8-cysteine defensins, but also include the 10-cysteine defensins from petunia (e.g. PhD1). The class II defensins are tightly clustered, whereas class I are distributed through groups 1 and 2 (Figure 2A), in line with their greater variety of mechanisms and efficacies (Payne et al., 2016;Parisi et al., 2018). Additionally, defensins that are known to enter the target cell during their mechanism of action (e.g. NaD1 and MtDef4) tend to be high in the PC1 direction, whereas defensins that remain outside the cell during their mechanism are lower in PC1 direction (e.g. RsAFP1) (Parisi et al., 2018).
Group 3, the intermediate cluster, contains a mixture of plant, animal and fungal proteins and a mixture of functions. When the sequence space is analysed for just the group 3 proteins alone, clear subclusters are evident, such as the C6 defensins, macin defensins, fungal defensins, plant fusion defensins, and various classes of scorpion toxin ( Figure 2C). The defensins with a C6 structures are similarly segregated into toxic and non-toxic subclusters ( Figure 2C). Taxa are clearly organised within group 3, and the non-toxic C6 defensins are the only subcluster to contain a mixture of plant, fungal and animal members. context. (C) Group 3 ('intermediates') isolated and re-analysed. C6 toxins in purple, non-C6 toxins in red, and non-toxic C6 defensins in blue. PDB codes for representative structures indicated where available. Axes are labelled as PC1b, 2b, 3b to emphasise that they are new principal component axes for group 3, distinct from those PCs found the overall sequence space.
Group 4 contains the majority of the plant signalling proteins involved in self/non-self recognition, which mediate sporophytic self-incompatibility during fertilisation. The cluster has a broad boundary with group 2 plant defensins, with the interface populated by sequences with intermediate properties, including the unusually disulphide bonded sex-locus g class (Shafee, Lay, et al., 2016).
Group 5 contains a small cluster of plant defensins that have a significant enrichment of histidines ( Fig S5E). It also contains two subclusters with intermediate numbers of histidines that they appear to have evolved from (Fig S3E). Currently none of these members are characterised, but given their unusual residue composition, it is likely that they will display interesting properties, possibly in addition to antimicrobial activity (Mirouze et al., 2006).
Group 7 contains the scorpion α-toxins and β-toxins. This group is particularly wellseparated from the others, with relatively few sequences in between. It has two main sub-clusters that separate the known α-and β-toxins (Gopalakrishnakone et al., 2015). The sequences that do fall in the relative void between groups 7 and 3 are the excitotoxins, which show similarities to the α-and β-toxins with elongated N-terminal loops (Oren et al., 1998).

Clusters indicate viable combinations of biophysical properties
The axes of the quantitative sequence space map are determined by covarying sets of residue biophysical properties (Table S1). The sequence and biophysical properties that most strongly determine a sequence's position in the projected sequence space are summarised by the PC loadings that define each axis.
Because the sequence space is built from an MSA, the data can be mapped onto known tertiary structures to give some spatiochemical information ( Figure 3). For example, loadings indicate that antimicrobial activity is strongly influenced by having a hydrophobic within loop 5 with charge at either end (Table S1, MSA columns 108-149), a region involved in lipid binding in plant defensins (Poon et al., 2014;Payne et al., 2016). These covarying residue property sets correlate well with regions of biological importance. Variation in this region is also correlated with features on loop 1b and 4a the opposite face of the structure (MSA columns 40-42 and 77-80, Figure 3A), and extend previous work describing roles for these loops in overall antimicrobial function (Bleackley et al., 2016).
Disulphide potential and residue occupancy were amongst the most highly loaded properties for the first PC axes (Table S1), in agreement with the known disulphide classes. The additional biophysical properties are necessary to identify well-defined groups, since repeating the analysis using only occupancy and disulphide potential is insufficient to find clear clusters (Fig S4C,D). Similarly, properties that are highly conserved across the whole superfamily (such as the four cysteines that define the cisdefensins) have low loading for the axes (Fig S8). The most useful information is contributed by residues that are not too constrained (e.g. the fully conserved cysteines) nor too unconstrained (e.g. positions where any residue is viable) (Fig S8).

Structural classes are well ordered in sequence space
Structure is well known to be more evolutionarily conserved than sequence and is therefore extremely informative for long evolutionary timescales (Shafee, Lay, et al., 2016;Undheim et al., 2016;Orengo and Thornton, 2005). Although the cis-defensin superfamily has a conserved overall fold, there are a diverse range of variations on the structure (including changes in disulphide number, loop length, and secondary structure elements). The sequence space analysis was performed using only a sequence alignment as an input, i.e. without using structural information (except for guiding the MSA) (Shafee, Robinson, et al., 2016). To test whether these similar structures cluster together into particular regions of sequence space or are distributed throughout, proteins with known structures were hierarchically clustered by their structural similarity. The resulting structural similarity dendrogram was threaded through the sequence space.
Defensins with similar biophysical properties also shared more similar structures ( Figure  3E,F), even for proteins of the same cysteine class (e.g. C8 defensins). Within proteins of the same cysteine motif, there are therefore distinct identifiable subclusters with shared biophysical and structural properties. Therefore disulphide classes, and groups of similar structures within classes, are inhabit distinct regions of sequence space. Although each group shares many biophysical properties, intermediate forms do exist between the more populated clusters.

Comparisons with other sequence and structure analyses
A maximum likelihood phylogeny using the MSA gave low bootstrap values, with an average bootstrap of 0.22 across all nodes of the tree (Fig S9a), and with deeper branches showing negligible bootstrap support ( Fig S9b). These bootstrap values are in line with previously published defensin phylogenies (Van der Weerden and Anderson, 2013; Zhu et al., 2005). Collapsing the tree at nodes with bootstrap >0.2 produced over 100 separate families (median 9 members). This stems from well-established limitations of phylogenetics for short, divergent, strongly-selected sequences (Shafee, Robinson, et al., 2016). Indeed, for the seven largest phylogenetic clades, only 43% sequences fall into the same clade in each bootstrap repeat ( Fig S10A).
Conversely, in the sequence space map of each of the alternative MSAs, 86% of sequences remain in the same cluster in each bootstrap repeat ( Fig S10B). Variation in the positions of sequences within the space upon bootstrapping is small enough that cluster prediction is perturbed for only a few sequences near the edges of clusters ( Fig  4A). This is in line with related multivariate analysis methods, which can be more accurate for more divergent sequences, robust to changes in dataset, and not as perturbed by intermediate or hybrid sequences (Higgins, 1992;Wallace and Higgins, 2007).
Both phylogenetics and the methods presented here rely on an accurate MSA. A set of 100 alternative MSAs using the same sequences were generated to test how variation on the starting MSA perturbs downstream analyses. The average bootstrap across all nodes of the phylogeny fell to 0.11, and the sequences placed into the same major clades fell from 43% to 32% (Fig S10D). The sequence space method proved more robust to alterations in the MSA fall from 86% to 83% (Fig S10C). This is largely because the properties that dominate the separations in sequence space are also those that are most consistently aligned in the alternative MSA.
Sequence similarity networks (SSNs) represent an alternative method that does not require an MSA and is often informative for diverse protein superfamilies (Atkinson et al., 2009;Cheng et al., 2014). However SSNs struggle to organise defensins due to their low sequence similarities, causing the network to fracture into smaller sub-networks, even at a permissive expect value of 10 -4 (Shafee, Lay, et al., 2016). These sub-networks agree well with the clustering found in the sequence space analysis (Fig S11), allowing them to be arranged relative to one another. However, the sequence space analysis also gives information on the biophysical determinants of the identified sequence groups. Finally, because the layout is deterministic, rather than stochastic, bootstrap and jackknife replicates can be used to check the sensitivity of the sequence locations or number of clusters to the dataset used ( Figure 4). Defensin sequence space can be explored using a webtool To facilitate interactive viewing and query of the sequence space, a simple webtool is available at TS404.shinyapps.io/DefSpace (source code at github.com/TS404/DefSpace). This tool can interactively display the sequence space, and calculate the locations of query sequences (Fig S12). It also identifies whether a sequence is a cis-or transdefensin and which cysteine motifs the sequence contains (if any). Cysteine motifs are based on regular expressions that encompass 90% of the variation in known sequences (Table S2). This allows defensin sequences of interest to be quickly checked for their position in order to give some expectation of their function.

Discussion
Defensins are found in most transcriptomes and genomes, yet predicting their function has remained elusive due to their highly diverse sequences and activities.
The broader sequence-function relationships and evolutionary history of the cisdefensins are only beginning to be understood. For example, mutagenesis studies have identified some of the sequence determinants that differentiate antimicrobial from neurotoxic activity in arthropod cis-defensins (Zhu et al., 2014). Conversely, conservation of antiplasmodal activity in a hypothetical ancestral defensin from the base of the chelicerates was demonstrated by ancestral sequence reconstruction (Cabezascruz et al., 2016). This work allows for more ancient evolutionarily events in the superfamily's history to be analysed.

Implications for early defensin evolution
The ancient evolution of the defensins is highly unclear. The recent separation into two separate superfamilies indicates independent origins for the cis-defensins and transdefensins (Shafee, Lay, et al., 2016). However, the early evolution within each superfamily remains obscured by their extreme sequence divergence. Some additional insight can be gained from the organisation of functions within the sequence space, which captures the limits of the viable sequence diversity. As mutations have generated sequences in new regions of sequence space, selection has constrained them to clustered regions where they retain or gain useful functions.
The superfamily displays great sequence diversity and many clusters have proteins between them with intermediate properties. Nevertheless, there are clearly preferred regions of sequence space with several clear voids indicating unfavorable sequence property combinations. At the same time, some activities are achieved by proteins in distinct regions of sequence space, highlighting how the defensins have evolved multiple viable ways of achieving similar functions. It is also consistent with several independent recruitments of the fold to toxic function.
The analysis also gives clues to the origins of the superfamily. The C6 defensins are clearly more similar to one another than they are to other defensins, even when present across multiple taxa (Shafee et al., 2017). This supports a scenario where the C6 defensin scaffold is more ancient than the other disulphide classes, which represent elaborations of this more ancient fold ( Figure 5). It is also in line with the possibility of a yet more ancient ancestral structure with only two disulphides (Shafee et al., 2018). The clustering of the C8 defensin fold is consistent with divergence from the C6 defensins before the separation of plants and animals, as opposed to multiple origins of the C8 fold. Flattened schematics of (A) the full superfamily's sequence space, coloured as in Figure 1, and (B) just the intermediate group, coloured as in Figure 2. Structures shown where available. The most ancient scaffolds are likely the C6 defensins involved in host defence and antimicrobial activity, followed by the C8 defensins. Subsequent elaborations of those scaffolds gave rise to the variety of folds and functions now present in the superfamily.

Prediction of function from sequence
Defensin sequences are not randomly distributed on sequence space, despite their extreme sequence diversity. Instead, the sequence space contains well-defined clusters of sequences with similar biophysical properties, separated by sparsely-populated voids. However, there remain plenty of sequences with intermediate properties. Groups 1 and 2 are the exception to this, with a continuous transition of intermediates at their interface, in line with their similar microbial functions (though via varied mechanisms) (Parisi et al., 2018). The number of covarying residues indicate a high degree of cooperation in determining defensin function.
The identified groups correlate well with currently known functions within the superfamily. There are clear separations between plant and animal defensins, as well as antimicrobial, signalling, and toxic functions. The PC axes that separate these groups also describe the key biophysical properties that characterise the different functions. This information will be useful in identifying likely biological roles of sequences, as well as which sequence elements determine that function, and whether it can be further engineered by mutagenesis. Newly identified sequences of interest can be added to the sequence space to identify their relative position, identify cluster centrality, and find nearby neighbours with known functions, mechanisms or structures.
Plant, animal, and fungal sequences are largely segregated, and mostly form clearly identifiable clusters. Even the taxonomically diverse C6 class is organised within group 3, with the long n-loop C6 defensins further clustered in group 6. Key sequence properties of the C6 subclusters have therefore been conserved since the plant-fungal-animal split, despite extensive sequence change.
The sequence space organisation lays the groundwork for improving our understanding of how divergent evolution gave rise to the superfamily's functional diversity. How have the different antimicrobial mechanisms evolved? Is the ability of antimicrobial plant defensins to traverse cell walls and enter the nucleus related to their role in development, possibly by acting as transcription factors or inducing apoptosis? What is the function of the well-defined cluster of uncharacterised histidine-rich defensins? What activities are exhibited by those defensins that lie in the sparsely populated intercluster regions?

Benefits and limitations
The extreme sequence diversity of the cis-defensin superfamily affords both opportunities and difficulties. It provides excellent library of functional peptides, with a scaffold compatible with a range of different activities. Conversely, the same diversity limits the application of standard sequence analysis methods and protein engineering. Although any displayed sequence space is necessarily a simplification of the full megadimensional sequence space, the simplified projections presented here retain a great deal of useful information, and are relatively intuitive to understand.
This quantitative sequence space map of the cis-defensin superfamily provides a foundation for classifying defensins and understanding their sequence-function relationships. Biologically relevant clusters are clearly identifiable within the sequence space, as well as sequences with intermediate properties. Sequence space axes report on key properties that separate proteins with a layout is deterministic rather than stochastic.
Since an MSA is used, proteins analysed by this method must be related to one another, such that columns of the MSA contain homologous residues (unlike e.g. similarity networks where only pairwise alignments are needed). Proteins not derived from a common ancestor do not have homologous residues aligned and so cannot be compared. However, it may be cautiously extendable to sequences for which analogous residues can be confidently compared.
Confidently resolving the "twilight zone" evolutionary relationships of highly divergent sequences is currently beyond the scope of either phylogenetic, or non-phylogenetic sequence analysis methods (Rost, 1999;Atkinson et al., 2009;Inkpen and Doolittle, 2016;Pearson and Sierk, 2005). This method therefore forgoes an explicit evolutionary model, and so does not distinguish whether changes in sequences' biophysical properties are conserved from an ancestral state or converged upon from different states (Jackson et al., 2018). Convergence of multiple biophysical properties would generate similar homoplasy to that which can occur in phylogenies. Particularly in the case of small CRPs, like the defensins, the strong selective pressures and high evolvability due to the scaffold's structural robustness make it likely that some covarying property sets have been arrived upon in separate lineages.

Conclusions
Quantitatively mapping the sequence space of the cis-defensins provides a useful and informative classification system. It identifies key biophysical properties that separate biological functions, and hierarchically clusters groups of sequences by their biophysical properties. Biological kingdoms and annotated functions are well organised within their own regions of the sequence space. The cluster-centrality of a sequence is clearly identifiable, as are intermediates between the main clusters.
This sequence analysis technique may also be broadly applicable to other cysteine-rich protein superfamilies, whose sequences are similarly too diverse for standard classification techniques. When phylogenetics is applicable, sequence space maps provide complementary information on how the biophysical properties of a protein have evolved.
The accompanying webtool facilitates easy identification of sequences of interest within the sequence space map and provides additionally information on superfamily and known motifs.

Sequence gathering
Homologous cis-defensin sequences were gathered as per reference (Shafee, Lay, et al., 2016). Briefly, cis-defensin structures were gathered using DALI (Holm and Rosenström, 2010), with the plant defensin NaD1 (PDB:1MR4) as the initial query. Unique proteins whose structures had Z-scores >2 were collected and used as queries in turn until no new structures were identified. These sequences were used as query sequences for BLASTp searches against the non-redundant protein database and Sol genomics network database (E-value cutoff <0.005) (Mueller et al., 2005). Additional rounds of sequence gathering were performed by using the least redundant sequences as queries for subsequent rounds of BLASTp until no new sequences were identified. This yielded 2019 total sequences from 352 species.

Sequence alignment
Because defensin sequences have high sequence length variation, sequence alignments are more accurate when structurally homologous cysteines are constrained, allowing the homologous inter-cysteine loops to align (Shafee, Robinson, et al., 2016). Briefly, multiple sequence alignments were generated using the CysBar webserver (Shafee, Robinson, et al., 2016) and ClustalΩ (Sievers et al., 2011). Homologous cysteines were barcoded to constrain the alignment and allow the correct inter-cysteine loops to align. Overall protein structure was not used to otherwise inform the sequence alignment. Alternative alignments were compared with AlignStat (Shafee and Cooke, 2016). The final MSA (supplementary data file 1) was used for subsequent analyses (Fig S13). The MSA showed good alignment of homologous secondary structure elements, with indels mainly occurring in solvent-exposed loop regions, and with agreement with structure alignments. To test the dependence of the analyses on the MSA, a set of 100 alternative MSAs of the same sequences were also generated using Guidance2 as described later in the section.

Structure similarity network
Structure similarity was calculated as per reference (Shafee, Lay, et al., 2016). Briefly, for cis-defensins with known structure, pairwise structure alignment using combinatorial extension was performed with the proCKSI webserver to generate an RMSD distance matrix. This matrix was hierarchically clustered by neighbour joining of the most similar structures to generate a bifurcating dendrogram which was then mapped onto the sequence space coordinates in [R]. Disulphide classes were annotated based on cysteine motif and connectivity (Shafee et al., 2017).

Homology model structure
Published protein structures are used with the exception for NbD2 (no structures have yet been solved for any his-rich group 5 protein). For illustrations, a simple homology model was made of NbD2 using SWISS-model, with PDB:3PSM selected as the best template structure (Arnold et al., 2006).

Phylogeny
A maximum likelihood phylogeny with 1000 bootstraps was generated based on the MSA using RaxML (Stamatakis, 2014). The optimal substitution model was identified using ProtTest as Whelan and Goldman model with a gamma distribution (Darriba et al., 2011;Whelan and Goldman, 2001). A strictly trimmed alignment (using trimAl (Capella-Gutiérrez et al., 2009)) did not give notably different results from the full alignment. Phylogenies were annotated using iTOL (Letunic and Bork, 2016).

Numericisation
To quantitatively place sequences in a multidimensional sequence space, each residue of each sequence in the MSA was described by its biophysical properties. The variables used were R-group molecular weight (Daltons), net charge (Coulombs), hydrophobicity (Doolittle index) (Kyte and Doolittle, 1982), disorder propensity (TOP-IDP) (Campen et al., 2008), disulphide potential (binary descriptor), and occupancy (binary descriptor). These properties encompass the main differences between the naturally occurring amino acids. Disulphide potential is included in this case since disulphides are particularly important to defensin structures. MSA column occupancy accounts for different sequence lengths. See supplementary table S3 for the values used. Note that this is also compatible with non-natural amino acids, or additional biophysical properties of interest.
The MSA with 2019 sequences and 196 columns was therefore converted into a numerical matrix with 2019 rows and 1176 columns (196 × n where n is the number of biophysical properties used, in this case, n=6 leading to 196 × 6 = 1176 columns). The resulting matrix of numerical values was used to represent the raw protein sequence space before multidimensional scaling.
Values were normalised within each property. Gaps were given the average value of their column for each property (other than occupancy) such that they had no effect on subsequent multidimensional scaling.

Multidimensional scaling
The highly multidimensional, numericised protein sequence space was analysed using Principal Components Analysis (PCA) to summarise the main covarying sets of properties (using [R] prcomp) (Dev. Core Team, 2011). The resulting principal components describe covarying sets of residue properties.

Bayesian clustering
Bayesian clustering was performed using the Gaussian finite mixture method (using [R] mclust) (Fraley and Raftery, 2012). The first 40 PCs were used for clustering since they summarised the most important 30% of the information contained in the 1176 dimensions.
Briefly, this algorithm calculates the models the distribution of data points as a set of spheroid clusters with varied sizes, elongation and orientation. The optimal number of clusters is chosen based on goodness of fit (Bayesian Information Criterion). Adding clusters to the model improves the models fit to the data until an optimal number of clusters is reached, after which additional clusters fail to improve the model's fit.

Repeatability metrics
Bootstraps of the sequence space were generated by iteratively ignoring 10% of columns in the same MSA and generating the sequence space each time. Variation in the coordinates in the sequence space were reported as well as the optimal number of Bayesian clusters to summarise the data. Similarly jack-knife replicates were performed by randomly ignoring 10% of sequences in the MSA.
To further assess the dependence of the methods on the input MSA, 100 alternative MSAs were generated using Guidance2, which generates plausible variant alignments of indels (Sela et al., 2015). These MSAs were used to generate maximum likelihood phylogenies and sequence space maps as described below.
To assess phylogenetic dependence on the input MSA, each of the 100 alternative MSAs was used to generate a maximum likelihood phylogeny using the same model parameters as for the original tree. For each of the 100 trees, the 7 largest clades were identified by ward clustering in [R] and compared for repeatability.
To assess SeqSpace dependence on the input MSA, the 100 were each numericised, scaled and clustered as described above. The Bayesian clusters found in each repeat were then compared for repeatability in [R].
Sequence similarity networks do not depend on an MSA (rather they typically use all-vsall pairwise alignments) so a comparable measure was not possible in this regard.
33 Figure S7 | Subclusters of group 2 separate C8 defensins from different taxa. Subclusters of group 2 when the group's sequence space is calculated separately, with plant C8 defensins coloured green, mollusc C8 defensins coloured yellow, and insect C8 defensins coloured purple.   In addition to some basic background information (not shown) two tabs exist. (A) The view tab opens an interactive diagram of the sequence space. (B) The find tab indicates the location of a query sequence within the sequence space. If the user is unsure of the whether the sequence is from the cis-or trans-defensin superfamily, the program will attempt to assign it based on cysteine motifs and sequence similarities. In both cases, activating 'selection mode' allows the user to select one or more sequences to view them in a multiple sequence alignment.

40
Figure S13 | Flow chart of sequence space analysis workflow. See methods section for details of individual processes. Rectangles indicate processes, parallelograms indicate inputs/outputs. Upward triangles indicate data splitting, downward triangles indicate data merging.