Fast Whole-Genome Phylogeny by Compression: the COVID-19 case

—We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, causing the COVID-19 disease, using compression in the form of the alignment-free NCD (Normalized Compression Distance) method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6,500 viruses. The results comprise that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6,500 viruses are identiﬁed (given by their registration code) with larger NCD’s. The NCD’s are compared with the NCD’s between the mtDNA’s of familiar species. We treat the question whether Pangolins are involved in the SARS-CoV-2 virus. The NCD method or shortly the compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here we use it for the complex case of determining this similarity between the COVID-19 virus SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely matches earlier efforts by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method showing that one can achieve equivalent results both simply and fast.


I. INTRODUCTION
In the 2019, 2020, and 2021 pandemic of the COVID-19 disease caused by the SARS-CoV-2 virus, studies of the phylogeny and taxonomy of this virus use essentially two methods: alignment-based analysis, for example [21], and alignment-free machine-learning [25].
In alignment-based methods one customarily takes two or more sequences over a bounded alphabet of letters and makes them identical by changing letters, introducing gaps and so on, where each of these operations carries a cost.The total cost is computed by adding these sub costs.The alignment with the least total cost is preferable.Problems are amongst others that similar regions on sequences being compared may be far apart causing massive alignment costs and that the computation cost of an alignment may be too high or forbidding.Alignment methods have many more drawbacks which make alignment-free methods more attractive [32], [30], [34].
Machine learning algorithms learn a concept using an input consisting of many examples of that concept.It is alignment-free but uses complicated algorithms with many parameters to set.This was used in the phylogeny analysis of the COVID-19 virus in [25] together with decision trees and other devices.
The purpose of this study is to identify the closest viruses to the SARS-CoV-2 virus isolate and to determine their structural relations (taxonomy) in phylogeny trees 1 and 2 and the Normalized Compression Distances (NCD's) in Table I using compression.We compare this method with the other methods in terms of results and speed.The method is alignment-free, based on lossless compression of the whole-genome sequences of base pairs of the involved viruses, and is called the normalized compression distance (NCD) method or RC is a research associate with the Centre for Mathematics and Computer Science (CWI).Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands.Email: cilibrar@gmail.comPV is with the CWI and the University of Amsterdam, Fac.FNWI, Science Park 904, 1098 XH Amsterdam, The Netherlands.He is the corresponding author.Address: CWI, Science Park 123, 1098XG Amsterdam, The Netherlands.Email: Paul.Vitanyi@cwi.nlshortly the compression method since it computes the NCD's between pairs of genomes.
We computed the NCD's between 6,751 viruses and a selected SARS-CoV-2 virus but we can only visualize part of them.With this method we quantify the proximity relations between pairs of viruses by their NCD's and compare them to similar relations between the mtDNA's of familiar mammal species in order to gain an intuition as to what they mean.
Phylogeny analysis is widely used but there are challenges of, amongst others, accuracy and speed.The NCD as a measure of similarity seems a good idea to solve these challenges.The compression method determines the similarity of two sequences using a single formula once, and thus does not use many examples.We apply it here to the complex problem of SARS-CoV-2 phylogeny and verify whether the results using this method are comparable with those from the other used methods.The conclusion is that the NCD method is an alignment-free method that gives accurate phylogeny results, while being possibly faster than both alignment-based methods and previously used alignment-free methods.
Earlier studies pointed to the origin of the SARS-CoV-2 virus as being from Bats.It is thought to belong to lineage B (Sarbecovirus) of Betacoronavirus.From phylogenetic analysis and genome organization it was identified as a SARS-like coronavirus, and to have the highest similarity to the SARS bat coronavirus RaTG13 and being similar to the two bat SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL-CoVZC45, see for example [21], [25].
We computed the NCD's of 6,751+15,409=22,160 virus pairs plus the NCD's used for Figures 1 and 2. Altogether, this took about 5-10 hours in a combined run on a home desktop computer (a minicomputer called Meerkat from a Linux computer company called System76).It uses less than 2 seconds per pair which comes to more than 2000 pairs per hour.The same program can be re-used for different phylogenies and questions.This is possibly the easiest and fastest method to establish whole-genome phylogeny.
Figures 1 and 2 were done with the PHYLIP (PHYLogeny Inference Package) [33].The programming to obtain the NCD matrices underlying the phylogenetic trees in the figures is easy; essentially compute the NCD's for all pairs of viruses to be compared.The comparisons between the NCD's of viruses and the NCD's of the mtDNA's of species are made on the basis of the Table in Figure 9 of [5].

Author Affiliations
RC is a research associate with the Centre for Mathematics and Computer Science (CWI), Science Park 123, 1098 XG Amsterdam, The Netherlands.PV is with CWI and the University of Amsterdam, Fac.FNWI, Science Park 904, 1098 XH Amsterdam, The Netherlands.

II. METHOD
A string is a sequence of finite length over a finite alphabet, here A,C,G,T.To assess the similarity of two strings we use a simple formula using the compressed versions of both strings and of the concatenation of them to compute the Normalized Compression Distance (NCD) between them.This distance is a quantity in between 0 (identical) and 1 (totally different) expressing similarity.The compression is by a lossless compressor ('lossless' means that the original can be reconstructed from the compressed version, that is, nothing of the compressed string gets lost).The distance was proposed in [16], [5] and especially the last reference where the name NCD was coined.
In [3], [11], [4] the compression method is validated and shown to work better for better compressors.We used zpaq as compression algorithm since [13] "bench marked 430 settings of 48 representative compressors (including 29 specialized sequence compressors and 19 general-purpose compressors) on 27 representative FASTA-formatted test data sets including individual genomes of various sizes, DNA and RNA data sets, and standard protein data sets." Figure 1.A of the cited paper compares the compressors with respect to their compression ratio: zpaq is the best of the considered general-purpose compressors and competitive with the best considered specialized sequence compressors.

HOW WE USED THE METHOD
Using the compression method to compute the phylogeny of the SARS-CoV-2 virus, we first obtained a list of over 6,500 viruses we wanted to compare the SARS-CoV-2 virus to, without duplicates, partially sequenced viruses, and SARS-CoV-2 viruses (except for a single one it turned out).A particular SARS-CoV-2 virus isolate served as standard against which to determine the NCD's of the other viruses.We selected this SARS-CoV-2 virus isolate from the many, at least 15,500, examples available on 17 july 2020 in the GISAID database.We computed for each virus in the list of over 6,500 viruses its NCD with the selected SARS-CoV-2 virus isolate.As compression program we used zpaq.Subsequently we ordered the resulting NCD's from the smallest to the largest.The virus causing the smallest NCD distance with the SARS-CoV-2 virus isolate was regarded as the most similar to that virus.We took the 60 viruses which had the least NCD's with the selected SARS-CoV-2 virus isolate and computed the phylogeny of those viruses and the SARS-CoV-2 virus isolate.Next we compared the last mentioned virus with 37 close viruses including Pangolin ones, to determine the relation of the SARS-CoV-2 virus with the Pangolin viruses.

III. MATERIALS
We downloaded the data in two parts and stored them on the public repository GitHub.The two data sets were as follows.One data set was obtained from the authors of the machine-learning-approach study [25] and consisted of the 6,751 virus sequences used in that paper.There was one SARS-CoV-2 virus among them.
The second data set of 66,899 SARS-CoV-2 virus sequences was downloaded from the hCoV19 directory of GISAID [10] on July 17th, 2020.After data cleaning all 21 virus sequences with /2017, /2018, or /2019 in the name, presumably wrongly classified, were not included.This last set contains three Pangolin viruses with NCD's against the selected SARS-CoV-2 virus of respectively 0.738175, 0874027, and 0.873367.All of the GISAID data with /2020 in the directory are SARS-CoV-2 viruses.Because one of them had the registration code "hcov-19," identical with the name of the directory itself, we reduced the data set by removing that sequence: gisaid hcov-19 2020 07 17 22.fasta.faihCoV-19 29868 1713295574 80 82.Details of the data cleaning are deferred to the Supplemental Material.
We retained altogether 66,897+6,751=73,648 raw sequences that were considered in our effort.The GISAID download reduced after data cleaning to a set of unique 15,578 viral sequences with the known nucleotides A,C,G, and T. This was further reduced to 15,409 sequences.Each viral sequence is an RNA sequence of around 30,000 base pairs.The total size of all sequence data together is in the order of two gigabyte.The 6,751 unique sequences obtained from the authors of [25] were over the letters A,C,G, and T already.

SELECTION OF A SARS-COV-2 VIRUS ISOLATE AS STANDARD
FOR THE NCD'S As a representative of the SARS-CoV-2 viruses we selected the most common one in the data set as the basis for the NCD against the 6,751 non-SARS-CoV-2 viruses.It occurred in the sorted virus list 105 times.In Table I it  The worst-case NCD across all the remaining sequences of the SARS-CoV-2 virus to the selected SARS-CoV-2 virus is 0.044986 and the average NCD is 0.009879.The worst-case sequence [22] can be found at gisaid hcov-19 2020 07 17 22.fasta¡ hCoV-19/USA/OH-OSUP0019/2020-EPI ISL 427291-2020-03-31 with registration code EPI ISL 427291.This shows that the 15,409 SARS-CoV-2 viruses retrieved from GISAID on July 17 in 2020 with /2020 in the name all have sequences that hardly differ from one another.
The results presented in this paper are so much larger than this worst-case NCD that they do not change under subtracting/adding the worst-case NCD to the selected SARS-CoV-2 virus amongst all SARS-CoV-2 viruses.Since the NCD is a metric and thus satisfies the triangle property [5, Theorem 6.2], the main results will not be unduly upset by a different choice of selected SARS-CoV-2 virus since the distances involved vary only by at most ±0.044986.Let us illustrate this.Let x, y, z be objects in a metric and let us denote the distance between x and y by d(x, y), the distance between x and z by d(x, z), and that between y and z by d(y, z).Then |d(x, z) − d(y, z)| ≤ d(x, y).If y is the selected SARS-CoV-2 virus, x is an arbitrary SARS-CoV-2 virus, and z is one of the non-SARS-CoV-2 viruses, then |d(x, y)| ≤ 0.044986.Thus, the results presented are unaffected by the choice of selected SARS-CoV-2 virus.

SORTED NCD'S
In Table I the 60 least NCD's are displayed of all NCD's between the selected SARS-CoV-2 virus and the 6,751 non-SARS-CoV-2 viruses sorted from small to large.The selected SARS-CoV-2 virus appears on the top of the list with NCD toward itself of 0.00362117.It should be 0 as explained before and is due to a small error margin in the computation.
The next virus is the only SARS-CoV-2 virus in the database of 6,751 virus sequences, but not the selected one, with NCD=0.0111034.This gives confirmation that the NCD's in the list are accurately calculated since the number is so very small.The code is MN908947.apart from the above SARS-CoV-2 virus.It is Sarbecovirus/EPI ISL 402131.fasta/¡BetaCoV/bat/Yunnan/RaTG13/2013-EPI ISL 402131.The first part is the classification of the subfamily of viruses, the code EPI ISL 402131 is that of the virus itself which one can use with Google to obtain further information.The virus isolate is sampled from the Rhinolophus affinis, a medium-size Asian bat of the Yunnan Province, PRC China, in 2013.It is 29855 bp RNA linear VRL 24-MAR-2020, and its final registration code is MN996532.The human coronavirus genome shares at least 96.2% of its identity with its bat relative, while its similarity rate with the human strain of the SARS virus is much lower, only 80.3% [7].The NCD distance between the selected SARS-CoV-2 virus and this virus is about the same as that between the mtDNA's of the Chimpansee and the PigmyChimpansee according to the Table in Figure 9 of [5] The first of these three viruses is the Bat SARS-like coronavirus isolate bat-SL-CoVZC45, complete genome at 29802 bp RNA linear VRL 05-FEB-2020 of the previous family of viruses.Its NCD with the selected SARS-CoV-2 virus is slightly higher than the mtDNA NCD between the Human and the Gorilla at 0.737 and slightly lower than the mtDNA NCD between the Human and the Orangutan at 0.834.
The second of these three viruses is the Bat SARS-like coronavirus isolate bat-SL-CoVZXC21, complete genome at 29732 bp RNA linear VRL 05-FEB-2020 of the same family.The same comparison as before of the NCD between this virus and the selected SARS-CoV-2 virus with the NCD's of mtDNA's between species holds also for this second virus.
The third virus of the three is the Bat SARS-like coronavirus Rs3367, complete genome at 29792 bp again of the same family.Comparing the NCD between this virus and the selected SARS-CoV-2 virus yields that it is slightly lower than the NCD between the Human mtDNA and the Blue Whale mtDNA at 0.920 and slightly higher than the NCD between the mtDNA of the Finback Whale and the mtDNA of the Brown Bear at 0.915.
In the sense of the NCD therefore the SARS-CoV-2 virus is likely from the family Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes; Nidovirales; Cornidovirineae; Coronaviridae; Orthocoronavirinae; Betacoronavirus; Sarbecovirus.Figure 1 shows the phylogeny directed binary tree built from the NCD matrix in the Supplemental Material of the 60 virus sequences in Table I.
The S(T ) value in Figure 1 tells how well the tree represents the distance matrix concerned [5].To clarify, let there be n objects.They have n × n NCD's.Hence they exist in n-dimensional space.Mapping that space onto two dimensions gives distortions of the distances involved.A flat map representing the earth sphere gives such problems of distortion of distances.A tree can represent the n × n distances between n objects easier than a more demanding 2-dimensional map.The S(T ) value tells how well these distances are preserved (0 is not at all and 1 is perfect).
In Figure 1 all sequences are labeled as they occur in the data of [25] together with their registration code.The most interesting are the 11th to 15th sequences of the tree from the top of the page.The 13th is EPI ISL 402131 which is the bat/Yunnan/RaTG13/2013, that is, the RaTG13 Bat Coronavirus sampled in Yunnan in 2013.The 14th label is the selected SARS-CoV-2 virus which occurs 105 times in the virus database of GISAID, that is, EPI ISL 428253.The 15th label is MN908947 which is the SARS-CoV-2 virus Wuhan Hu-1 from the Wuhan Seafood market collected Dec. 2019, submitted 05-JAN-2020, and reported in Nature, 579(7798), 265-269, 270-273 (2020).This is the only SARS-CoV-2 virus in the 6,751 sequences obtained from the authors of [25].The 11th and 12th labels are the CoVZC45 Bat Coronavirus and the CoVZXC21 Bat Coronavirus.Numbers 11, 12, 13, 14, and 15 have the least NCD distances to the selected SARS-CoV-2 virus.The entire 60 × 60 NCD matrix underlying this tree is too large to display here but is supplied in the Supplemental Material.

VIRUSES CLOSE TO THE SELECTED SARS-COV-2 VIRUS
Figure 2 contains the selected SARS-CoV-2 sequence and all the GISAID sequences ending in /2017, /2018, and /2019 which includes the three Pangolin viruses in the second paragraph of the Section III on Materials and the Data Cleaning section of the Supplemental Material.Added are a dozen close sequences from all GISAID sequences and a dozen close sequences from the machine learning approach study [25] data.The ladders in the directed binary tree are usually there to accommodate more than two outgoing branches from The figure therefore gives a visual representation of the structural relations close to the SARS-CoV-2 virus.We use human mtDNA as outgroup; it is about the same size as the viruses and it can be assumed to be completely different.Using an unrelated sequence which is called the "outgroup" is a common method to determine where the root of a phylogenetic tree is: where it joins the tree there is the root.Since S(T ) = 0.998948 the tree almost perfectly represents the NCD distance matrix concerned in the Supplemental Material.a single node.The entire 37 × 37 NCD matrix is too large to display here but is supplied in the Supplemental Material.

V. THE PANGOLIN CONNECTION
As we saw in the section on Data Cleaning in the Supplemental Material, the GISAID hCoV19 database is littered with Pangolin viruses (now called Pangolin-CoV) from 2017, 2018, and 2019.Several studies e.g.[17], [35] hold that while the SARS-CoV-2 virus probably originates from bats it may have been transmitted to another animal and/or recombined with a virus there and transmitted zoonotic to humans.The other animal is most often identified as the Pangolin.The compression method shows that the NCD's between the Pangolin-related virus and the human SARS-CoV-2 virus are far apart.However, they are not farther apart than 0.738175 to 0.874027.This is between the NCD distance of the mtDNA of a Human to a Gorilla up to the mtDNA of a GreySeal to a BlueWhale.The bat-SL-CoVZXC21 and bat-SL-CoVZC45 viruses are at NCD=0.791082 and NCD=0.788416.That is in between the mtDNA NCD distance of a Human versus a Gorilla at 0.737 and the mtDNA distance of a Human versus an Orangutan at 0.834.Thus, according to the NCD results, the Pangolin and Bat origins are possible for the SARS-CoV-2 virus while the Bat origin is more likely and modification by the Pangolin Fig. 2. The phylogenetic directed binary tree built from 37 virus sequences with the human mtDNA as outgroup to determine the root. is a possibility.But the RaTG13 virus has an NCD=0.444846distance with the selected SARS-CoV-2 virus which is close to one half of the Pangolin NCD's given here.As noted before, this is comparable with the NCD between the Chimpansee and the PigmyChimpansee according to the Table in Figure 9 of [5].Also in the tree of Figure 2 the Pangolin viruses are generally far from the selected SARS-CoV-2 virus.Hence the hypothesis that the Pangolin species is the origin of the SARS-CoV-2 virus, or an intermediary between the Bat and the Human, is perhaps unlikely if we follow these NCD results.

VI. DISCUSSION
We determined the phylogeny and taxonomy of the SARS-CoV-2 virus.Earlier studies using alignment-based methods have suggested that the SARS-CoV-2 virus originated from bats.Bats are a known reservoir of viruses that can zoonotic transmit to humans [19].The machine-learning approach, an alignment-free method [25], came to the same conclusion.The current, completely different, approach agrees with this and has as runner-up the Pangolin.
To compare the compression method with the alignment-free method based on machine learning [25] we used the latter database of over 6,500 unique viral sequences.To select the particular SARS-CoV-2 virus against which these viral sequences are compared we used a database of about 15,500 unique SARS-CoV-2 viruses.The NCD analysis places the RaTG13 bat virus closest to the SARS-CoV-2 virus followed by the two SARS-like corona viruses bat-SL-CoVZXC21 and bat-SL-CoVZC45 just like the machine learning method and alignment-based methods.Additionally, some Pangolin viruses are close and over 6,500 viruses identified by their registration code in the database of [25] are farther away (the codes of slightly less than 60 of them are given in Table I).
For alignment-based methods on viruses of the database of [25] such a feat is complex or infeasible.On the other hand these methods yield biological interpretations.The alignment-free machine-learning method [25] determines the taxonomy of a single virus sequence and gives different phylogeny trees based on different aspects of it.The compression method gives a single phylogenetic tree for a set of many DNA/RNA sequences.The compression method is domainindependent and requires no parameters to be set, apart from the used compression algorithm.Obtaining essentially the same main positive results as the earlier studies, and agreeing with the generally believed hypothesis, this method is less complicated than the previous methods.It yields quantitative evidence that can be compared with the NCD's among the mtDNA's of familiar mammals.Since the method is uncomplicated and very fast it is useful as an exploratory investigation into the phylogeny and taxonomy of viruses of new epidemic outbreaks.
appears at the top with an NCD of 0.00362117 and has the official name of gisaid hcov-19 2020 07 17 22.fasta¡hCoV-19/USA/WI-WSLH-200082/2020-EPI ISL 471246-2020-04-08.The NCD of the selected SARS-CoV-2 virus against itself should have been 0 but because of the unavoidable compression/computation inaccuracy it is slightly more.

Fig. 1 .
Fig. 1.These 60 viruses have the least NCD with the SARS-CoV-2 virus.The figure therefore gives a visual representation of the structural relations close to the SARS-CoV-2 virus.We use human mtDNA as outgroup; it is about the same size as the viruses and it can be assumed to be completely different.Using an unrelated sequence which is called the "outgroup" is a common method to determine where the root of a phylogenetic tree is: where it joins the tree there is the root.Since S(T ) = 0.998948 the tree almost perfectly represents the NCD distance matrix concerned in the Supplemental Material.

TABLE I
FIRST 60 ITEMS IN THE SORTED LIST OF USED 6,751 VIRUS SEQUENCES.