A Systematic Survey on Deep Generative Models for Graph Generation

Graphs are important data representations for describing objects and their relationships, which appear in a wide diversity of real-world scenarios. As one of a critical problem in this area, graph generation considers learning the distributions of given graphs and generating more novel graphs. Owing to their wide range of applications, generative models for graphs, which have a rich history, however, are traditionally hand-crafted and only capable of modeling a few statistical properties of graphs. Recent advances in deep generative models for graph generation is an important step towards improving the fidelity of generated graphs and paves the way for new kinds of applications. This article provides an extensive overview of the literature in the field of deep generative models for graph generation. Firstly, the formal definition of deep generative models for the graph generation and the preliminary knowledge are provided. Secondly, taxonomies of deep generative models for both unconditional and conditional graph generation are proposed respectively; the existing works of each are compared and analyzed. After that, an overview of the evaluation metrics in this specific domain is provided. Finally, the applications that deep graph generation enables are summarized and five promising future research directions are highlighted.

neered towards modeling a pre-selected family of graphs, such as random graphs [15], small-world networks [16], and scale-free graphs [12]. However, due to their simplicity and hand-crafted nature, these random graph models generally have limited capacity to model complex dependencies and are only capable of modeling a few statistical properties of graphs. Such methods usually fit well towards the properties that the predefined principles are tailored for, but usually cannot do well for the others. For example, a contact network models can fit flu epidemics but not dynamic functional connectivity. However, in many domains, the network properties and generation principles are largely unknown, such as those for explaining the mechanisms of mental diseases in brain network, cyber-attacks, and malware propagations. For the other example, Erdos-Rényi graphs do not have the heavy-tailed degree distribution that is typical of many real-world networks. In addition, the utilization of the apriori assumption limits these traditional techniques from exploring more applications in larger scale of domains, where the apriori knowledge of graphs are always not available.
Considering the limitations of the traditional graph generation techniques, a key open challenge is developing methods that can directly learn generative models from an observed set of graphs, which is an important step towards improving the fidelity of generated graphs. It paves the way for new kinds of applications, such as novel drug discovery [17], [18], and protein structure modeling [19], [20], [21]. Recent advances in deep generative models, such as variational autoencoders (VAE) [22] and generative adversarial networks (GAN) [23], have supported a number of deep learning models for generating graphs have been proposed, which formalized the promising area of Deep Generative Models for Graph Generation, which is the focus of this survey.

Formal Problem Definition
A graph is defined as G(V, E, F, E), where V is the set of N nodes, and E ⊆ V × V is the set of M edges. e i,j ∈ E is an edge connecting nodes v i , v j ∈ V. The graph can be conveniently described in the form of matrix or tensor using its (weighed) adjacency matrix A. If the graph is node-attributed or edge-attributed, there are node attribute matrix F ∈ R N ×D assigning attributes to each node or edge attribute tensor E ∈ R N ×N ×K assigning attributes to each edge e i,j . K is the dimension of the edge attributes, and D is the dimension of the node attributes.
Given a set of observed graphs G = {G 1 , ...G s } sampled from the data distribution p(G), where each graph G i may have different numbers of nodes and edges, the goal of learning generative models for graphs is to learn the distribution of the observed set of graphs. By sampling a graph G ∼ p model (G), new graphs can hence be achieved, which is known as deep graph generation, the short form of deep generative models for graph generation. Sometimes, the generation process can be conditioned on additional information y, such that G ∼ p model (G|y), in order to provide extra control over the graph generation results. The generation process with such conditions is called conditional deep graph generation.

Challenges
The development of deep generative models for graphs poses unique challenges, which are mainly listed below.
Non-unique Representations. In the general setting, a graph with n nodes can be represented by up to n! equivalent adjacency matrices, each corresponding to a different, arbitrary node ordering. Given that a graph can have multiple representations, it is difficult for the models to calculate the distance between the generated graphs and groundtruth graphs white training. Thus it may require us to design either a pre-defined node ordering or a node permutation invariant reconstruction objective function.
Complex Dependencies. The nodes and edges of a graph have complex dependencies and relationships. For example, two nodes are more likely to be connected if they share common neighbors. Therefore, the generation of each node or edge cannot be modeled as an independent event. One way to formalize the graph generation is to make autoregressive decisions, which naturally accommodate complex dependencies inside the graphs through sequential formalization of graphs. Towards this challenge, in this survey, existing works are described and compared regarding to what kinds of dependencies (e.g., dependencies among nodes, among edges or between node and edges) they can capture.
Large Output Spaces. To generate a graph with n nodes the generative model may have to output n 2 values to specify its structure, which makes it expensive, especially for large-scale graph. However, it is common to find graphs containing millions of graphs in real-world, such as citation and social networks. Consequently, it is important for generative models to scale to large-scale graphs for realistic graph generation and to accommodate such complexity in the output space. The scalability of the existing works is a critical issue in comparing and evaluating the different categories of graph generative models in this survey, as discussed in Section 2.1.5 and Section 2.2.3.
Discrete Objects by Nature. The standard machine learning techniques, which were developed primarily for continuous data, do not work off-the-shelf, but usually need adjustments. A prominent example is the back-propagation algorithm, which is not directly applicable to graphs, since it works only for continuously differentiable objective functions. To this end, it is usual to project graphs (or their constituents) into a continuous space and represent them as vectors/matrix. However, reconstructing the generated graphs from the continuous representations is a challenge. Reconstructing the desecrate graph objects (i.e., nodes and edges) from continuous spaces results into different graph decoder process, such as sequentially generating the nodes of the graphs or generating the adjacent matrix of graphs in one-shot. This challenge motivates the major criteria in the taxonomy of the existing methods in this survey.
Evaluation for Implicit Properties Evaluating the generated graphs is a very critical but challenging issue, due to the unique properties of graphs which with complex and high-dimensional structure and implicit features. Existing methods use different evaluation metrics. For example, some works [18], [24], [25] compute the distance of the graph statistic distribution of the graphs in the test set and graphs that are generated, while other works [21], [26] indirectly use some classifier-based metrics to judge whether the generated graphs are of the same distribution as the training graphs. It is important to systematically review all the existing metrics and choose the approximate ones based on their strengths and limitations according to the application requirements. Towards this challenge, we summarized a unified evaluation framework for graph generation in Section 4, including popular evaluation metrics for both unconditional and conditional graph generation.
Various Validity Requirements. Modeling and understanding graph generation via deep learning involve a wide variety of important applications, including molecule designing [17], [27], protein structure modeling [20], AMR parsing [10], [28], et al. These inter-discipline problems have their unique requirements for the validity of the generated graphs. For example, the generated molecule graphs need to have valency validity, while the semantic parsing in NLP requires Part-of-Speech (POS)-related constraint. Thus, addressing the validity requirements for different applications is crucial in enabling wider applications of deep graph generation. In this paper, we elaborate the way how the existing works improve the validity of the generated graph when introducing the rule-based generation models in Section 2.1.4. In addition, the real-world applications of solving validity issues are elaborated in Section5.
Black-box with Low Reliability. Compared with the traditional graph generation area, deep learning based graph modeling methods are like black-boxes which bear the weaknesses of low interpretability and reliability. Improving the interpretability of the deep graph generative models could be a vital issue in unpacking the black-box of the generation process and paving the way for wider application domains, which are of high sensitivity and require strong reliability, such as smart health and automatic driving. In addition, semantic explanation of the latent representations can further enhance the scientific exploration of the associated application domains. Interpretability and reliability are important aspects when comparing and evaluating the different graph generation methods in this survey, as discussed in Section 3.1.3, which compares the different conditional graph generation categories.

Our Contributions
Various advanced works on deep graph generation have been conducted, ranging from the one-shot graph generation to sequential graph generation process, accommodating various deep generative learning strategies. These methods aim to solve one or several of the above challenges by works from different fields, including machine learning, bio-informatics, artificial intelligence, human health and social-network mining. However, the methods developed by different research fields tend to use different vocabularies and solve problems from different angles. Also, standard and comprehensive evaluation procedures to validate the developed deep generative models for graphs are lacking.
To this end, this paper provides a systematic review of deep generative models for graph generation. The goal is to help interdisciplinary researchers choose appropriate techniques to solve problems in their applications domains, and more importantly, to help graph generation researchers understand the basic principles as well as identify open research opportunities in deep graph generation domain. As far as we know, this is the first comprehensive survey on deep generative models for graph generation. Below, we summarize the major contributions of this survey: • We propose a taxonomy of deep generative models for graph generation categorized by problem settings and methodologies. The drawbacks, advantages, and relations among different subcategories have been introduced. • We provide a detailed description, analysis, and comparison of deep generative models for graph generation as well as the base deep generative models. • We summarize and categorize the existing evaluation procedures and metrics, the benchmark datasets and the corresponding results of deep generative models for graph generation tasks. • We introduce existing application domains of deep generative models for graphs as well as the potential benefits and opportunities they bring into these applications. • We suggest several open problems and promising future research directions in the field of deep generative models for graph generation.

Relationship with Deep Generative Models
Deep generative models form the backbone of the base learning methods of all the existing deep generative models for graph generation. Specifically, deep generative models offer a very efficient way to analyze and understand unlabeled data. The idea behind generative models is to capture the inner probabilistic distribution that generates a class of data to generate similar data [29]. Emerging approaches such as generative adversarial networks (GANs) [23], variational auto-encoders (VAEs) [22], generative recursive neural network (generative RNN) [30] (e.g., pixelRNNs, RNN language models), flow-based learning [31], and many of their variants have led to impressive results in myriads of applications. We provide a review of five popular and classic deep generative models in Appendix A.

Relationship with Existing Surveys
There are three types of existing surveys that are relevant to our work. The first type mainly centers around the traditional graph generation by classic graph theory and network science [32], which does not focus on the most recent advancement in deep generative neural networks in AI. The second type is about representation learning on graphs [33], [34], [35], which focuses on learning graph embedding given existing graphs. Few works include a handful of deep generative models that could be used for representation learning tasks. The third type is specific to particular applications such as molecule design by deep learning, instead of for this generic technical domain. As yet, there have been very few systematic surveys on deep generative models for graph generation, with just two recent contemporaneous papers [36], [37]. Both of these categorize graph generation mainly in terms of the general backbone learning models utilized (i.e., autoregressive, auto-encoder-based, RL-based, adversarial, and flow-based), We have instead opted to review this research field from more comprehensive and graph-specific perspectives, including task formulation, graph generating techniques, evaluations, applications and datasets. This yields a number of advantages compared to the existing ones, namely: (1) Two main problems are covered: This survey comprehensively summarizes the techniques used for both unconditional and conditional generation problems; (2) Categorization from graph-specific perspectives: This survey categorizes the existing graph generation models (e.g., sequential-generating and one-shot generation) utilizing graph-specific perspectives, instead of the all-in purpose generative models developed and applied for all kinds of data generation; (3) Reviews of evaluation methods: This survey provides a comprehensive overview of the existing evaluation procedures and metrics for graph generation tasks; (4) More applications: This survey provides a comprehensive summary for a diverse range of the applications, including domains like biology, NLP and program analysis; and (5) Performance comparisons: This survey compares the performance of existing state-of-the-arts methods on both synthetic and real-world datasets, reaching several insightful conclusions.

Outline of the Survey
The remaining part of the survey is organized as follows. In Sections 2 and 3, we provide the taxonomy of deep graph generation, and the taxonomy structure is illustrated in Fig. 1. Section 2 compares related works of unconditional deep graph generation problem and summarizes the challenges faced in each. In Section 3, we categorize the conditional deep graph generation in terms of three subproblem settings. The challenges behind each problem are summarized, and a detailed analysis of different techniques is provided. Lastly, we summarize and categorize the evaluation metrics in Section 4. Then we present the applications that deep graph generation enables in Section 5. At last, we discuss five potential future research directions and conclude this survey in Sections 6 and 7. Due to the space limit, We also summarize the benchmark dataset and performance evaluation of existing works in Appendix B.

UNCONDITIONAL DEEP GENERATIVE MODELS FOR GRAPH GENERATION
The goal of unconditional deep graph generation is to learn the distribution p model (G) based on a set of observed realistic graphs being sampled from the real distribution p(G) by deep generative models. Based on the style of the generation process, we can categorize the methods into two main branches: (1) Sequential generating: this generates the nodes and edges in a sequential way, one after another, (2) One-shot generating: this refers to building a probabilistic graph model based on the matrix representation that can generate all nodes and edges in one shot. These two ways of generating graphs have their limitations and merits. Sequential generating performs the local decisions made in the preceding one in an efficient way, but it has difficulty in preserving the long-term dependency. Thus, some global properties (e.g., scale-free property) of the graph are hard to include. Moreover, existing works on sequential generating are limited to a predefined ordering of the sequence, leaving open the role of permutation. One-shot generating methods have the capacity of modeling the global property of a graph by generating and refining the whole graph (i.e. nodes and edges) synchronously through several iterations, but most of them are hard to scale to large graphs since the time complexity is usually over O(N 2 ) because of the needs of collectively modeling global relationship among nodes.

Sequential generating
This type of methods treats graph generation as a sequential decision making process, wherein nodes and edges are generated one by one (or group by group), conditioning on the sub-graph already generated. By modeling graph generation as a sequential process, these approaches naturally accommodate complex local dependencies between generated edges and nodes. A graph G is represented by a sequence of components S = {s 1 , ..., s N }, where each s i ∈ S can be regarded as a generation unit. The distribution of graphs p(G) can then be formalized as the joint (conditional) probability of all the components in general. While generating graphs, different components will be generated sequentially, by conditioning on the already generated parts.
One core issue is how to break down the graph generation to facilitate the sequential generation of its components, namely determining the formalization unit s i for sequentialization. The most straightforward approach is to formalize the graph as a sequence of nodes, which are the basic components of a graph, to support the nodesequence-based generation. These methods essentially generates the graph by generating each node and its O(N ) associated edges in turn, and hence usually result in a total complexity of O(N · N ) = O(N 2 ). Another approach is to consider a graph as set of edges, based on which a number of edge-sequence-based generation methods have been proposed. These methods represent the graph as a sequence of edges and generate an edge, as well as its two ending nodes, per step, which leads to a total complexity of O(|E| · 2) = O(|E|). Edge-sequence-based methods are thus usually better at sparser graphs than node-sequence-based approaches. Although both these two types are successful at retaining pairwise node-relationships, they often fall short when it comes to capture higher-order relationships [65]. For example, gene regulatory networks, neuronal networks, and social networks all contain a large number of triangles; and molecular graphs contain functional groups. These all indicate the need to generalize the units of the sequential generation from nodes/edges to interesting sub-graph patterns, known as motifs. To this end, a number of motifsequence-based methods have been proposed that represent a graph utilizing a sequence of graph motifs so that a block of nodes and edges in a graph motif are generated simultaneously in each step, which usually boosts better efficiency. Although the above three types are all versatile in end-toend graph generation, they fall short in ensuring generating "valid" graphs, namely graphs that enforce correct grammar and constraints, which are very common in fields like programming languages and molecules modeling. To solve this, several rule-sequence-based methods have been proposed for domain specific applications, where a graph is constructed based on a predefined sequence of rules by incorporating appropriate domain expertise. A more detailed description of methods in each category is provided below.
2.1.1 Node-sequence-based General Framework. Node-sequence-based methods essentially generate the graph by generating one node and its Node-sequence-based Collective-associated-edge-generation [17], [18], [38], [39], [40], [41], [42] Progressive-associated-edge-generation [43], [44], [45], [46] Edge-sequence-based Independency-based [47] Dependency-based [19], [48] Motif-sequence-based Domain-agnostic-based [49] Domain-specific-based [27], [50], [51] Rule-sequence-based [9], [52] One-shot Generating Adjacency-matrix-based MLP-based [20], [21], [53], [54], [55], [56] Message-Passing-based [57], [58], [59], [60] Invertible-transform-based [61], [62] Transposed-convolution-based [25], [63] Edge-list-based Random-walk-based [64], [65], [66], [67] Node-similarity-based [2], [68], [69], [70], [71], [72]  associated edges in each step, as shown in Fig. 2(a). The graph is modeled by a sequence based on a predefined ordering π on the nodes. Each unit s i in the sequence of components S is represented as a tuple s i = (v π i , {e i,j } j<i ) (as shown at the bottom of Fig. 2(a)), indicating that at each high-level step, the generator generates one node v π i and all its associated edges set {e i,j } j<i 1 . Specifically, in nodesequence-based generation, generating a unit s i involves two main steps. In the first step, a node is generated conditioning on the current generated graph G i , which can be interpreted to learn p(v π i |G i ). The second step is to generate the associated edges set {e i,j } j<i for node v π i . There are two options when it comes to generating the 1. Here we omit the node and edge attribute symbol for clarity, but it is important to bear in mind that the generated node and edges can all have attributes (i.e. type, label). associated edges of each node: 1) collective associated-edge generation, where the predictions are conducted on all of the node pairs between v π i and the other existing nodes in G i in a single shot to directly generate the associated edges set {e i,j } j<i ; and 2) progressive-associated-edge generation, which generates the associated edges of node v π i in sequence, with two actions per step: addEdge, which determines the size of {e i,j } j<i , and selectNode, which determines to which node the node v π i will be connected if addEdge is needed.
Collective associated-edge generation. To conduct the predictions on node pairs between the newly generated node v π i and all the other existing nodes, most of the works [17], [18], [38], [39], [40], [41], [69] resort to predicting the adjacent vector A π i,· , which covers all the potential edges from the newly added node v i to the other existing nodes. Thus, we can further represent each unit as s i = (v π i , A π i,· ). And the sequence can be represented as Seq(G, π) = {(v π 1 , A π 1,· ), ..., (v π N , A π N,· )}. The aim is to learn the distribution as: where v π <i refers to the nodes generated before v π i and A π <i,· refers to the adjacent vectors generated before A π i,· . Such joint probability can be implemented by sequential-based architectures such as generative RNN models [17], [18], [26], [40] and auto-regressive flow-based learning models [69]. Here we introduce the RNN-based models as an example.
In the generative RNN-based models, the node distributions p(v π i |v π <i , A π <i,· ) are typically assumed as a multivariate Bernoulli distribution that is parameterized by φ i ∈ R T , where T refers to the number of node categories. The edge existence distribution p(A π i,· |v π ≤i , A π <i,· ) can be assumed as the joint probability of several dependent Bernoulli distributions: where p(A π i,· |A π <i,· ) is parameterized by θ i ∈ R i−1 and the distribution of p(A π i,j |A π i,<j , A π <i,· ) is parameterized by each entry θ i,j in θ i . The architecture for implementing Eq. (1) and (2) can be regarded as a hierarchical-RNN, where the outer RNN is used for generating the nodes and the inner RNN is used for generating each node's associated edges. After either a node or edge is generated, a graph-level hidden representation of the already generated sub-graph is updated through a message passing neural network (MPNN) [73]. Specifically, at each Step i, a parameter φ i will be calculated through a multilayer perceptron (MLP)based function based on the current graph-level hidden representation. The parameter φ i is used to parameterize the Bernoulli distribution of node existence, from which node v π i is sampled. After that, the adjacent vector A π i,· is generated by sequentially generating each of its entry.
Progressive associated-edge generation. The aboveintroduced collective associated-edge generation has a time complexity of O(N 2 ) that is time-consuming especially for sparse graphs. A remedy is to progressively select the nodes to be connected with the current node v π i from the existing nodes v π <i , until the desired number of nodes is selected, which is small for sparse graph. Specifically, for the current node v π i , we generate {e i,j } j<i by applying two functions: 1) an addEdge function to determine the size of the edge set {e i,j } j<i of node v π i and 2) a selectNode function to select the nodes to be connected from the existing graph [43], [44], [45], [46]. The complexity of progressive associatededge generation method is O(M N ) where M refers to the number of edges.
Specifically, after generating a node v π i in the first step, an addEdge function is used to output a parameter as f addEdge (h π vi ), following a Bernoulli distribution indicating whether to add an edge to the node v π i . Here h π vi refers to the node-level hidden states of v π i which is calculated through a node embedding function, e.g., MPNN [73] based on the already-generated parts of the graph. If an edge is determined to be added, the next step is selecting the neighboring node v π j from the existing nodes. To achieve this, we can compute a score m π j (as Eq. (3)) for each existing node v π j based on selectNode function f selectN ode , which is then passed through a softmax function [74] to be properly normalized into a distribution of nodes: p(ei,j|v π <i , {e<i,j}j<i) = sof tmax(m π i,j ).
The MLP-based function f selectN ode maps pairs of nodelevel hidden states h π vi and h π vj to a score m π i,j for connecting node v π j to the new node v π i . This can be extended to handle discrete edge attributes by making m π i,j a vector of scores with the same size as the number of the edge attribute's categories, and taking the softmax over all categories of the edge attribute. Based on the aforementioned procedure, the two functions f addEdge and f selectN ode are iteratively executed to generate the edges within the edge set {e <i,j } j<i of node v π i until the terminal signal from function f addEdge indicates that no more edges for node v i are yet to be added.

Edge-sequence-based
General Framework. Edge-sequence-based methods [19], [47], [48] consider a graph to be a sequence of edges and generate an edge, along with its two end nodes, in each step, as shown in Fig. 2(b). It defines an ordering of the edges in the graph and also an ordering function α(·) for indexing the nodes. The graph G can then be modeled by a sequence of edges, with each unit in the sequence being a tuple represented by s i = (α(u), α(v), F u , F v , E i u,v ), where each element of the sequence consists of a pair of nodes' indexes α(u) and α(v) for nodes u and v, node attribute F u , F v , and the edge attribute E (i) u,v for the edge at Step i. In edge-sequence-based generation, there are two ways to generate a unit s i , with the first based on the assumption that α(u) and α(v) are mutually independent while the second assumes they are mutually dependent, with their details as follows.
Independency-based. Goyal et al. [47] used depth first search (DFS) algorithm [75] as the ordering index function α(·) to construct graph canonical index of nodes. The conditional distribution for generating each edge in graph G can be formalized as where s <i refers to the already-generated edges and nodes. A customized long short-term memory (LSTM) is designed which consists of a transition state function f trans for transferring the hidden state of the last step into that of the current step (in Eq. (6)), an embedding function f emb for embedding the already generated graph into latent representations (in Eq. (6)), and five separate output functions for the above five distribution components (in Eq (6) to Eq. (11)). It is assumed that the five elements in one tuple are independent of each others, and thus the inference is as: where s i−1 refers to the generated tuple at Step i − 1 and is represented as the concatenation of all the component representations in the tuple. h (i) G is a graph-level LSTM hidden state vector that encodes the state of the graph generated so far at Step i. Given the graph state h (i) G , the output of five functions f α(u) , f α(v) , f Fu , f Fv , f Eu,v model the categorical distribution of the five components of the newly formed edge tuple, which are paramerized by five vectors θ α(u) , θ α(v) , θ Fu , θ Fv , θ Eu,v respectively. Finally, the components of the newly formed edge tuple are sampled from the five learnt categorical distributions.

Dependency-based.
To further characterize the dependency between α(u) and α(v), Bacciu et al. [19] assume the existence of node dependence in a tuple. This method deals with homogeneous graphs without considering the node/edge categories, by representing each tuple in the sequence as s i = (α(u), α(v)) and formalizing the distribution as p(s i |s <i ) = p(α(u)|s <i )p(α(v)|α(u), s <i ). Then, the first node is sampled in the same way as in Eq. (8), while the second node in the tuple is sampled as follows: where the function g emb is used for embedding the index of the first generated node u in the pair.

Motif-sequence-based
General Framework. Motif-sequence-based methods [27], [49], [50], [51] represent a graph G as a sequence of graph motifs, Seq(G) = {C 1 , ..., C M }, where the block of nodes and edges that constitute each graph motif C i are generated at each step, as shown in Fig. 2(c). A new graph-motif C i is generated in each step by conditioning on the current graph G i at Step i and then it is connected to G i .
A key problem in motif-based methods is how to connect the newly generated graph motif C i to G i , given that there are many potential ways to link two sub-graphs. These linking strategies are highly dependent on the definition of the graph motifs. For Domain-agnostic graphs, given a predefined node ordering, the graph motifs are usually defined as a combination of consecutive nodes. This allows us to predict the associated edges of all the nodes in C i and connect it to G i based on these predictions. For Domain-specific graphs, the motifs are usually defined and connected based on specific domain knowledge, such as chemical motifs for a task involving molecular structure generation.
Domain-agnostic-based. This line of works is designed for generating general graphs without the need of domain expertise; it is similar to the collective-associatededge-generation category under the line of node-sequence generation by generating the adjacent vectors for each edge, such as GraphRNN [18], except for the generation of several nodes instead of one per step. As described in Section 2.1.1, a graph G is represented as a sequence of node-based tuples is generated per step. Based on this node sequence, Liao et al. [49](GRANs) regard every tuples consisting of B recursive nodes as a graph motif C i and generates each block per step. In this way, the generated nodes in the new graph motif follow the ordering of the nodes in the whole graph and contain all the connection information of the existing and newly generated nodes. To formalize the dependency among the existing and newly generated nodes, GRANs proposes an MPNNbased model to generate the adjacent vectors. Specifically, for the i-th generation step, a graph G i is generated which contains the already-generated graph with B · (i − 1) nodes and the edges among them, as well as the B nodes in the newly generated graph motif. For these new B nodes, edges are initially completely added to connect all of them with each other and the previous B · (i − 1) nodes. Then an MPNN-based graph neural network (GNN) [76] on this augmented graph is used to update the nodes' hidden states by encoding the graph structure. After several rounds of message passing implementation, the node-level hidden states of both the existing and newly added nodes are used to infer the final distribution of the newly added edges as: where the Bernoulli distribution p(A π i,j |C <t ) is parameterized for modeling the edge existence through an MLP, which takes the node-level hidden states as input.
Domain-specific-based. The definition of graph motifs and its connections can involve domain knowledge, such as in the situation of molecules generation (i.e., graph of atoms) [27], [50]. Jin et al. [27] propose the Junction-Tree-VAE by first generating a tree-structured scaffold over chemical substructures, and then combining them into a molecule with an MPNN. Specifically, a Tree Decomposition of Molecules algorithm [77] tailored for molecules to decompose the graph G into several graph motifs C i is followed, and each C i is regarded as a node in the tree structure. The other way of defining the graph motifs is to leverage the breaking of retrosynthetically interesting chemical substructures (BRICS) algorithm [78]. To generate a graph G, a tree is first generated and then converted into the final graph. The decoder for generating a T consists of both topology prediction function and label prediction function. The topology prediction function models the probability of the current node to have a child, and the label prediction function models a distribution of the labels of all types of C i . When reproducing a molecular graph G that underlies the predicted junction tree T , since each motif contains several atoms, the neighboring motifs C i and C j can be attached to each other as sub-graphs in many potential ways. To solve this, a scoring function (e.g., measuring the validness of the potentially generated graph) over all the candidates graphs is proposed, and the optimal one that maximizes the scoring function is the final generated graph.

Rule-sequence-based
General Framework. Several methods that have been proposed [9], [52] generate a sequence of production rules or commands to guide the graph construction process sequentially. This is usually the method of choice where the targeted graph has strong constraints or grammar that must be satisfied in order to construct a valid graph. For example, a molecule can not violate fundamental properties like charge conservation, which thus constrains the patterns available for the node types and edges of molecule graph. To ensure the validity of the generated graphs, graph generation is transformed by generating parse trees, which describe a discrete molecular structure utilizing context free grammar (CFG), while the parse tree itself can be further expressed as a sequence of rules based on a pre-defined order.
Kusner et al. [52] propose generating a parse tree that describes a discrete object (e.g. arithmetic expressions and molecule) by a grammar; they also proposed a graph generation method named GrammerVAE. An example of using the parse tree for molecule generation: to encode the parse tree, they decompose it into a sequence of production rules by performing a pre-ordered traversal on its branches from left-to-right, and then convert these rules into one-hot indicator vectors, where each dimension corresponds to a rule in the SMILES grammar. The deep convolutional neural network is then mapped into a continuous latent vector z. While decoding, the continuous vector z is passed through an RNN which produces a set of unnormalized log probability vectors (i.e., "logits"). Each dimension of the logit vectors corresponds to a production rule in the grammar. The model generates the parse trees directly in a top-down direction, by repeatedly expanding the tree with its production rules. The molecules are also generated by following the rules generated sequentially, as shown in Fig. 2(d). Although the CFG provides a mechanism for generating syntacticvalid objects, it is still incapable of guaranteeing the model for generating semantic valid objects [52]. To deal with this limitation, Dai et al. [9] propose the syntax-directed variational autoencoder (SD-VAE), in which a semantic re-striction component is advanced to the stage of syntax-tree generator. This allows for a the generator with both syntactic and semantic validity.

Comparison of different sub-categories
In this subsection, we compare the four categories of sequential-generating method from three aspects: (1) Scalability: time complexity determines the scalability of the graph generation methods. Node-sequence-based methods commonly have the time complexity of O(N 2 ) when N denotes to the number of nodes, while edge-sequence-based methods usually have the complexity of O(|E|). Thus, for sparse graphs where N 2 |E|, edge-sequence-based methods are more scalable than node-sequence-based ones. The complexities of motif-sequence-based methods vary from O(N 2 ) (e.g., for domain-agnostic type) to O(N ·|C|) (e.g., for domain-specific type), where |C| refers to the number of motifs. The complexity of rule-sequence-based methods usually linearly related to the number of rules in generating a graph; (2) Expressiveness: the expressiveness of generation model relies on its power to model the complex dependency among the objects in the graph. Node-sequence and edgesequence generation can capture the most sophisticated dependence, including node-node dependence, edge-edge dependence and node-edge dependence. While the motifsequence-based methods are able to model the dependence between graph-motifs which capture the high-order relationships and global patterns. Rule-sequence-based methods can model the dependency between the operation rules to capture the semantic patterns in building a realistic graphs, which are usually difficult to directly learn from the graph topology; (3) Application scenarios: the selection of categories of sequential generating techniques for a specific application scenario depends on its sensitivity to validness and the accessibility of the generation rules. Node-and edge-sequence-based methods are suitable in generating realistic graphs without the domain expertise (e.g., the known rules, constrains or candidate motifs), such as the social and traffic networks. Motif-sequence-based methods can partially guarantee the validness of the generated graph by selecting graph-motifs from the predefined valid motif candidates. Rule-sequence-based methods are more powerful in generating valid realistic graphs by following the correct grammar and constraints. Thus, the latter two types of methods are preferred in validness-sensitive applications, such as molecule generation and program modeling.

One-shot generating
One-shot generating methods learn to map each whole graph into a unified latent representation which follows some probabilistic distribution in latent space. Each whole graph can then be generated by directly sampling from this probabilistic distribution in one step. The core issue of these methods is usually how to jointly generate graph topology together with node/edge attributes. Considering that the graph topology can usually be represented in terms of adjacency matrix and node attribute matrix, the typical solution is to learn the distribution of these two and generate them in one shot, which is categorized as Adjacent-matrix-based generation. Learning the distribution of adjacency matrices is potentially expressive yet comes with inefficiency issue in both memory and time. To this end, Edge-list-based methods learn the local patterns and hence is usually good at handling larger graphs with simpler global patterns.

Adjacency-matrix-based
General Framework. Adjacency-matrix-based methods build models to directly map the latent embedding z to the output graph in terms of an adjacency matrix, generally with the addition of node/edge attribute matrices/tensors. Hence, how to best achieve an expressive and efficient mapping is the core challenge and there is usually a trade-off between them. Existing techniques are built upon popular deep neural network scenarios that are MLPbased, message-passing-based, invertible-transformationbased or transposed-convolution-based. MLP-based models are highly end-to-end, while message-passing-based approaches and transposed-convolution-based can explicitly model higher-order correlations in graphs. Invertibletransformation-based techniques more rigorously model invertible mappings but impose more limitations on the expressiveness.

MLP-based methods.
Most of the one-shot graph generation techniques involves simply constructing the graph decoder g(z) using MLP [20], [21], [53], [54], [55], [56], where the models' parameters can be optimized under common frameworks such as VAE and GAN. The MLP-based models ingest a latent graph representation z ∼ p(z) and simultaneously output adjacent matrix A π and node attribute F π , as shown in Fig. 3(a). Specifically, the generator g(z) takes Ddimensional vectors z ∈ R D sampled from a statistical distribution such as standard normal distribution and outputs graphs. For each z, g(z) outputs two continuous and dense objects:Ã π , which defines edge attributes andF π , which denotes node attributes through two simple MLPs. Both A π andF π have a probabilistic interpretation since each node and edge attribute is represented with probabilities of categorical distributions of types. To generate the final graph, it is required to obtain the discrete-valued objects A π and F π fromÃ π andF π , respectively. The existing works have two ways to realize this step detailed as follows.
In the first way, the existing works [20], [53], [54] use sigmoid activation function to compute A π and F π during the training time. At test time, the discrete-valued estimate A π and F π can be obtained by taking edge-and node-wise argmax inÃ π andF π . Alternatively, existing works [21], [55], [56] leverage categorical reparameterization with the Gumbel-Softmax [79], [80], which is to sample from a categorical distribution during the forward pass (i.e., F π i ∼ Cat(F π i ) and A π ij = Cat(Ã π ij )) and the original continuous-valuedÃ π andF π in the backward pass. In this way, these methods can perform continuous-valued operations during the training procedure and do the categorical sampling procedure to finally generate F and A.
Message-passing-based methods. Message-passing-based methods generate graphs by iteratively refining the graph topology and node representations of the initialized graph through the MPNN. Specifically, based on the latent representation z sampled from a simple distribution (e.g., Gaussian), we usually first generate an initialized adjacent matrix A 0 and the initialized node latent representations H 0 ∈ R N ×L , where L refers to the length of each node representation (here we omit the node ordering symbol π for clarity). Then A 0 and H 0 are updated though MPNN into A 1 and H 1 , which are the adjacent matrix and hidden states in the first intermediate layer, then another MPNN layer is applied to generate for the 2nd layer, etc. We can stack multiple such layers to explicitly characterize the higherorder correlation among nodes and edges. Each MPNN layer can be expressed as follows: where v 1 , v 2 , v 3 , w 1 and w 2 are trainable parameters. We can stack multiple such layers to explicitly characterize the higher-order correlation among nodes and edges, which is also illustrated in Fig. 3(c). Finally, after T layers' updating, the outputs A T i,j and F T i are used to parameterize the categorical distributions of each edge and node, based on which each edge A i,j and node F i are generated through categorical sampling introduced above. To learn the above generator, Existing methods leverage various learning frameworks such as VAE and GANs [57], [58], [59], or have a plain framework based on the score-based generation [60].

Invertible-transform-based methods.
Flow-based generative methods can also do one-shot generation, by a unique invertible function between graph G and the latent prior z sampling from a simple distribution (e.g., Gaussian), as shown in Fig. 3(b). Concretely, based on vanilla flow-based learning techniques introduced in Section A.4, special forward transformation G − → z and backward transformation z − → G needs to be designed.
Madhawa et al. [62] propose the first flow-based oneshot graph generation model called GraphNVP. To get z = (z F , z A ) from G = (A, F ) in the forward transformation, they first convert the discrete variable A and F into continuous variable A and F by adding real-valued noise), which is known as dequantization. Then two types of reversible affine coupling layers: adjacency coupling layers and node attribute coupling layers are utilized to transform the adjacency matrix A and the node attribute matrix F into latent representations z A and z F , respectively. The lth reversible coupling layers are designed as follows: refers to the ith entry of z l F ; denotes element-wise multiplication. Functions s A (·) and t A (·) stand for scale and translation operations which can be implemented based on MPNN, and s F (·), t F (·) can be implemented based on MLP networks. To get G = (F, A) from z = (z F , z A ) in the backward transformation, the reversed operation is conducted based on the above forward transformation operation in Eq. (15) and (16). Next a probabilistic feature matrixF is generated given the sampled z F and the generated adjacency matrix A through a sequence of inverted node attribute coupling layers. Likewise, the node-wise argmax ofF is used to get discrete feature matrix F .
Transposed-convolution-based methods. One typical type of graph decoder in the one-shot-generation techniques is constructed based on the transposed convolution neural networks [25]. The process is about generating the adjacent matrix of graph by taking the node latent representation vectors as input. The transposed-convolution-based decoder consists of a node transposed convolution layer and several edge transposed convolution layers.
The node transposed convolution layer is used to decode the edge representations of the graph based on the node embedding. For example, after a node transposed convolution layer, the edge representations E i,j between node v i and node v j can be computed as: where σ(H m iμ j ) means the transposed convolution contribution of node v i to its potential edge E i,j , which is made by the m-th entry of its node representations, and µ j represents one entry of the transposed convolution filter vectorμ ∈ R N ×1 that is related to node v j . L refers to the length of the node representation.
Several edge transposed convolution layers are recursively applied to decode the latent edge representations from the upper layer back to those of the lower layer. Thus, E l i,j between node v i and node v j in the (l + 1)th layer is computed as: whereφ l j N k1=1 E l i,k1 S k1 can be interpreted as the decoded contribution of node v i to its related edge representations E l+1 i,j , andφ l j refers to the element of transposed convolution filter vector that is related to node v j . σ refers to the activation functions.

Edge-list-based
General Framework. This category typically requires a generative model that learns edge probabilities, where all the edges are generated independently. These methods are usually applied when learning from one large-scale graph to generate a new one using the existing nodes. The general pipeline is composed of two main steps. A score is calculated for each edge (i.e., pair of nodes) to estimate the edge probability, after which the edges can be sampled.
In terms of how the edge probabilities are generated, existing works are further categorized as either random-walkbased [64], [65], [66], [67] or node-similarity-based [2], [68], [70], [71], [72]. Node-similarity-based models calculate the edge probability based on the similarity of each pair of node representations learnt from graphs, while random-walk-based methods estimate each edge probability by calculating the edge frequency for a large set of random walks generated by sampling from their distributions learnt from graphs.
Random-walk-based. This type of methods generate the edge probability based on a score matrix, which is calculated by the frequency of each edge that appears in a set of generated random walks. NetGAN [64] is proposed to mimic the large-scale real-world networks. Specifically, at the first step, a GAN-based generative model is used to learn the distribution of random walks over the observed graph, and then it generates a set of random walks. At the second step, a score matrix S ∈ R N ×N is constructed, where each entry denotes the counts of an edge that appears in the set of generated random walks. At last, based on the score matrix, the edge probability matrixÃ is calculated asÃ i,j = S i,j / N u,v S u,v , which will be used to generate individual edge A i,j , based on efficient sampling processes.
Following this, some works propose improving the NetGAN, by changing the way to choose the first node in starting a random walk [67] or learning spatial-temporal random walks for spatial-temporal graph generation [66]. Gamage et al. [65] generalize the NetGAN by adding two motif-biased random-walk GANs. The edge probability is thus calculated based on the score matrices from three sets of random walks (i.e. S (1) , S (2) , and S (3) ) that are generated from the three GANs. To sample each edge, one view S (k) is randomly selected from the three scores matrices. Based on S (k) , edge probabilityÃ i,j is calculated Node-similarity-based. These methods generate the edge probability based on pairwise relationships between the given or sampled nodes' embedding (as in [68]). Specifically, the probability adjacent matrixÃ is generated given the node representations Z ∈ R N ×L , where Z i ∈ R L refers to the latent representation for node v i .Ã will be used to generate individual edge A i,j , based on efficient sampling processes. Existing methods differ on how to calculateÃ. Several works [2], [68], [70] computeÃ i,j based on the inner-product operations of two node embedding Z i and Z j . This reflects the idea that nodes that are close in the embedding space should have a high probability of being connected. These works require a setting where node set is pre-defined and the node attribute F is known in advance. Specifically, by first sampling node latent representation Z i from the standard normal distribution, Kipf et al. [2], [68] calculate the probability adjacent matrix as A = Sigmoid(ZZ T ). The adjacent matrix A is then sampled fromÃ which parameterizes the Bernoulli distribution of the edge existence, as similar to work in [70].
Other works [71], [72] computeÃ i,j by measuring the closeness of two node,representations with 2 norm. Liu et al. [71] propose a decoder for calculatingÃ i,j as:Ã i,j = 1/(1 + exp(C( Zi − Zj 2 2 −1))), where C is called a temperature hyperparameter. Salha et al. [72] propose a gravity-inspired decoding schema as: where m j is the gravity scale of node v j learned from the input graph by its featured encoder.

Comparison of different sub-categories
In this subsection, we compare two aspects of the two different types of one-shot method: (1) Time complexity: Both adjacent-matrix-based and node-similarity-based edge-list generation have a complexity of O(N 2 ) since they need to consider every pairs of N nodes in the graph. Randomwalk-based edge-list generation is more scalable, as here the edges are sampled based on the edge probability, which is determined by the edge frequency in a set of generated random walks; and (2) Application scenarios: Since adjacent-matrix-based methods can handle global patterns with high expressiveness and minimum time consumption, these types of methods are widely used for small graphs (i.e., graphs with less than 1,000 nodes) whose global patterns are important, such as molecules and proteins. Edgelist-based methods, on the other hand, are efficient when it comes to generating large graphs whose local patterns are important, such as social networks and citation networks.

CONDITIONAL DEEP GENERATIVE MODELS FOR GRAPH GENERATION
The goal of conditional deep graph generation is to learn a conditional distribution p model (G|y) based on a set of observed realistic graphs G along with their corresponding auxiliary information, namely a condition y. The auxiliary information could be category labels, semantic context, graphs from other distribution spaces, etc. Compared with unconditional deep graph generation, in addition to the challenge in generating graphs, conditional generation needs to consider how to extract the features from the given condition and integrate them into the generation of graphs. Thus, to systematically introduce the existing conditional deep graph generative models, we mainly focus on describing how these methods deal with the conditions. Since the conditions could be any form of auxiliary information, they are categorized into three types, including graphs, sequence, and semantic context, shown as the yellow parts of the taxonomy tree in Fig. 1.

Conditioning on graphs
The problem of deep graph generation conditioning on another graph is also called as deep graph transformation (also known as deep graph translation) [25]. It aims at translating an input graph G S in the source domain to the corresponding output graphs G T in the target domain. Considering the entities that are being transformed during the translation process, there are two categories of works in the domain of deep graph generation conditioning on graphs: edge transformation and node-edge-co-transformation 2 .

Edge transformation
Overall Problem Formulation. The problem of edge transformation is to generate the graph topology and edge attributes of the target graph conditioning on the input graph. It requires the edge set E and edge attributes E to change while the graph node set and node attributes are fixed during the translation process as: The edge transformation problem has a wide range of real-world applications, such as modeling chemical reactions [86], protein folding [20] and malware cyber-network synthesis [25]. Existing works adopt different frameworks to model the translation process.
Some works utilize the encoder-decoder framework by learning abstract latent representation of the input graph through the encoder and then generating the target graph based on these hidden information through the decoder [25], [63]. For exampl, Guo et al. [25] propose a GAN-based model for graph topology transformation. The proposed GT-GAN consists of a graph translator and a conditional graph discriminator. The graph translator includes two parts: graph encoder and graph decoder. A graph convolution neural net [97] is extended to serve as the graph encoder in order to embed the input graph into nodelevel representations while a new graph deconvolution net is used as the decoder to generate the target graph.
Zhou et al. [82] propose modeling the underlying distribution of graph structures of the input graph at different levels of granularity, and then "transferring" such hierarchical distribution from the graphs in the source domain to a unique graph in the target domain. The input graph is characterized as several coarse-grained graphs by aggregating the strongly coupled nodes with a small algebraic distance to form coarser nodes. Overall, the framework can be separated into three stages. At the first step, the coarsegrained graphs at K levels of granularity are constructed from the input graph adjacent matrix A S . The adjacent matrix of the coarse-grained graph A (l) at the kth layer is defined as: where P (k) ∈ R N (l) ×N (l) is a coarse-grained operator for the kth level and N (l) refers to the number of nodes of the coarse-grained graph at level l. In the next stage, each coarse-grained graph at each level k will be reconstructed back into a fine graph adjacent matrix A (k) as: where R (k) ∈ R N (l) ×N (l) is the reconstruction operator for the kth level. Thus all the reconstructed fine graphs at each layer are in the same scale. Finally, these graphs are aggregated into a unique one by a linear function to get the final adjacent matrix as follows: where w k ∈ R and b k ∈ R are weights and bias.

Node-edge co-transformation
Overall Problem Formulation. The problem of node-edge co-transformation (NECT) is generating the node and edge attributes of the target graph conditioning on those of the input graph. It requires that both the nodes and edges can vary during the transformation process between the source graph and the generated target graph as follows: T : GS(VS, ES, FS, ES) − → GT (VT , ET , FT , ET ). In terms of the techniques on how the input graph is assimilated to generate the target graph, there are two categories: one is embeddingbased and the other is editing-based. Embedding-based NECT. The embedding-based NECT normally encodes the source graph into latent representations containing higher-level rich information of the input graph by an encoder, which is then decoded into the target graph by a decoder, as shown in Fig. 4(a) [24], [83], [84], [85], [88]. These methods are usually based on conditional VAEs [98] and conditional GANs [99].
Kaluze et al. [83] propose exploring the latent spaces of directed acyclic graphs (DAGs) and develops a neural network-based DAG-to-DAG translation model, where both the domain and the range of the target function are DAG spaces. The encoder M encode is borrowed from the deep-gated DAG recursive neural network (DG-DAGRNN) [100], which generalizes stacked RNNs on sequences to DAG structures. Each layer of the DG-DAGRNN consists of gated recurrent units (GRUs), which are repeated for each node vi. The encoder outputs an embedding h = M encode (GS), which serves as the input of the DAG decoder. The decoder follows the local-based node-sequential generation style as described in Section 2.1.1. Specifically, first, the number of nodes N of the target graph is predicted by an MLP with the input of h. Also, the hidden state of the target graph is initialized with h. Then at each step, a node vi as well as its corresponding edge set {ei,j}j<i are generated based on the hidden state at each step until an end node is added to the graph or the number of nodes exceeds a predefined threshold. Following this, a general graph-to-graph model [24] is proposed by first formalizing the graph into a DAG without loss of information and utilize recurrent based model to translate this DAG. They embeds the topology of the input graph into the node representations by exerting a topology constraint, which results in a topology-flow encoder. Their decoder follows the same node sequentialbased generation as proposed by You et al. [18]. There are also some embedding-based graph translation methods that represent the graph as a set of graph motifs, which are usually targeted for the task of molecule optimization [84], [85].
Editing-based NECT. Different from the encoder-decoder framework, Editing-based NECT directly modifies the input graph iteratively to generate the target graphs [86], [87], [101], as shown in Fig. 4(b). There are two ways to realize the process of editing the source graph. One is utilizing an RL agent to sequentially modify the source graph based on a formulated Markov decision process [86], [87] as described in Section A.5. The modification at each step will be selected from the defined action set, including "add node", "add edge", "remove bonds" et al. The other is to update nodes and edges from the source graph synchronously in a one-shot manner through the MPNN using several iterations [101].
You et al. [86] propose the graph convolutional policy network (GCPN), a general graph convolutional network based model for goal-directed graph generation through reinforcement learning. The model is trained to optimize the domainspecific property of the source molecule through policy gradient, and acts in an environment that incorporates domainspecific rules. They define a distinct, fixed-dimension and homogeneous action space amenable to reinforcement learning, where an action is analogous to link prediction. Specifically, they first define a set of scaffold sub-graphs {C1, ..., Cs} based on the source graph. This set acts as a sub-graph vocabulary that contains the sub-graphs to be added into the target graph during graph generation. Given the modified graph Gt at step t, they define the corresponding extended graph as Gt ∪Ci. Based on this definition, an action can either correspond to connecting a new sub-graph Ci to a node in Gt or connecting existing nodes within graph Gt.
Guo et al. [101] propose another way which edits the source graph iteratively, through the generation process extended from MPNN-based adjacency-based one-shot method in Section 2.2.1 and Fig. 4(c), which conducts the generation on both the node and edge attributes. The transformation process is modeled by several stages and each stage generates an immediate graph. Specifically, at each stage t, there are two paths, namely node translation and edge translation paths. In node translation path, an MLP-based influence-function is used for calculating the influence I (t) i on each node vi from its neighboring nodes, and another MLP-based updating-function is used for updating the node attribute as F i . The edge translation path is constructed in the same way as the node translation path, where each edge is generated by the influence from its adjacent edges. In addition, to capture and maintain the consistent between nodes and edges in the generated graph, a spectral-based regularization is enforced into the final optimization objective.

Comparison of different sub-categories
In this sub-section, we compare the two categories of methods in dealing with the node-edge-co-transformation (NECT). Since the comparison between different generating techniques is provided in Section 2, here we focus on the discussion regarding the relationship between the input and target graphs from three aspects: (1) Patterns captured from input graphs: embeddingbased NECT can capture the influences from the global patterns (e.g., density or molecule energy) of the input graphs onto the target graph with a graph-level latent representation. While the editing-based NECT has the advantage in modeling the influences from the local patterns (e.g., "hub" node or ring structure) of the input graphs onto the target graphs; (2) Interpretability: editing-based NECT provides a more interpretable way by explicitly showing the transformation in a step-by-step fashion from the input to target graphs, which is more suitable to applications which rely on high-level confidence; While embeddingbased NECT roughly connect the input and target graphs with a latent embedding which can not be semantically explained.
(3) Application scenarios: embedding-based NECT is capable of modeling the transformation with major and sophisticated changes from the input to target graphs, while editing-based NECT is more suitable to deal with the transformation with only small change, considering the efficiency.

Conditioning on sequence
The problem of deep graph generation conditioning on a sequence can be formalized as the deep sequence-to-graph transformation problem. It aims to generate the target graph GT conditioning on an input sequence X. The deep sequenceto-graph problem is usually observed in domains such as NLP [89], [90] and time series mining [91], [92].
The existing methods handle semantic parsing task [89], [90] by transforming a sequence-to-graph problem into a sequence-to-sequence problem and utilizing the classical RNNbased encoder-decoder model to learn this mapping. Chen et al. [89], [90] propose a neural semantic parsing approach named Sequence-to-Action, which models semantic parsing as an endto-end semantic graph generation process. Given a sentence X = {x1, ..., xm}, the Sequence-to-Action model generates a sequence of actions Y = {y1, .., ym} for constructing the correct semantic graph. A semantic graph consists of nodes (including variables, entities, types) and edges (semantic relations), with some universal operations (e.g., argmax, argmin, count, sum, and not). To generate a semantic graph, they define six types of actions: Add Variable Node, Add Entity Node, Add Type Node, Add Edge, Operation Function and Argument Action. In this way, the generated parse tree is represented as a sequence, and the sequence-to-graph problem is transformed into a sequenceto-sequence problem. Then the attention-based sequence-tosequence RNN model [102] with an encoder and decoder is utilized, where the encoder converts the input sequence X to a sequence of context sensitive vectors {b1, ..., bm} using a bidirectional RNN and a classical attention-based decoder generates action sequence Y .
Other methods handle the problem of Time Series Conditioned Graph Generation [91], [92]: given an input multivariate time series, the aim is to infer a target relation graph to model the underlying interrelationship between the time series and each node. Yang et al. [91] explore GANs in the conditional setting and propose the novel model of time series conditioned graph generation-generative adversarial networks (TSGG-GAN) for time series conditioned graph generation. Specifically, the generator in a TSGG-GAN adopts a variant of recurrent neural network called simple recurrent units (SRU) [103] to extract essential information from the time series, and uses an MLP to generate the directed weighted graph.

Conditioning on semantic context
The problem of deep graph generation conditioning on semantic context aims to generate the target graph GT conditioning on an input semantic context, which can be usually represented as additional meta-features. The semantic context can refer to the category, label, modality or any additional information that can be intuitively represented as a vector C. The main issue is deciding where to concatenate or embed the condition representation into the generation process. As a summary, the conditioning information can be added in terms of one or multiple of the following modules: (1) the node state initialization module, (2) the message passing process for MPNN-based decoding, and (3) the conditional distribution parameterization for sequential generating.
Yang et al. [93] propose a novel unified model of graph variational generative adversarial nets, where the conditioning semantic context is inputted into the node state initialization module. Specifically, in the generation process, they first model the embedding Zi of each node with separate latent distributions. Then, a conditional graph VAE (CGVAE) can be directly constructed by concatenating the condition vector C to each node latent representation Zi to get the updated node latent representationẐi. Thus, the distribution of the individual edge Ai,j is assumed as a Bernoulli distribution, which is parameterized by the valueÂi,j and is calculated aŝ Ai,j = Sigmoid(f (Ẑi) T f (Ẑj)), where f (·) is constructed by a few fully connected layers. Li et al. [44] propose a conditional deep graph generative model that adds the semantic context information into the initialized latent representations Zi at the beginning of the decoding process.
Li et al. [95] add the context information C into the message passing module in its MPNN-based decoding process. Specifically, they parameterize the decoding process as a Markov process and generate the graph by iteratively refining and updating from the initialized graph. At each step t, an action is conducted based on the current node hidden states To calculate h t i ∈ R L (L denotes the length of the representation) for node vi in the intermediate graph Gt after each updating of the graph, they utilize message passing network with node message propagation. Thus the context information C ∈ R K is added to the operation of the MPNN layer as follows: where W ∈ R L×L , Θ ∈ R L×L and Φ ∈ R K×L are all learnable weights vectors and K denotes the length of the semantic context vector.

EVALUATION METRICS FOR DEEP GRAPH GEN-ERATION
Evaluating the generated graphs as well as the learnt distribution of graphs are challenging and critical tasks for deep generative models in graph generation problem due to two major reasons: 1) Different from conventional prediction problems where merely deterministic predictions need to be evaluated, deep graph generation requires the evaluation of the learnt distributions. 2) Graph structured data is much more difficult to evaluate than simple data with matrix/vector structures or semantic data such as images and texts. Thus, we summarize the typical evaluation metrics in evaluating deep generative models for graph generation as shown in Figure 3. We first provide the metrics that can be used both for unconditional and conditional deep graph generation, and then introduce the metrics that are specially designed for conditional deep graph generation.

General evaluation for deep graph generation
To evaluate the quality of the generated graphs, existing literature covers three categories of evaluation metrics, namely statistics-based, classifier-based, and intrinsic-quality-based evaluations. The first two evaluation categories require comparison between the generated graph set and real graph set, while the intrinsic-quality evaluation directly measure the properties of the generated graph.

Statistics-based
In statistics-based evaluation, the quality of the generated graphs is accessed by computing the distance between the graph statistic distribution of real graphs and generated graphs. We first introduce seven typical graph statistics that measure different properties of graphs and, thereafter introduce the metrics that measure the distance between two distributions regarding different graph statistics. There are seven typical graph statistics that are used in existing literature, which are summarized as follows: (1) Node degree distribution: the empirical node degree distribution of a graph, which could encode its local connectivity patterns. (2) Clustering coefficient distribution: the empirical clustering coefficient distribution of a graph. Intuitively, the clustering coefficient of a node is calculated as the ratio of the potential number of triangles the node could be part of to the actual number of triangles the node is part of.
(3) Orbit count distribution; the distribution of the counts of node 4-orbits of a graph. Intuitively, an orbit count specifies how many of these 4-orbits substructures the node is part of. This measure is useful in understanding if the model is capable of matching higher-order graph statistics, as opposed to node degree and clustering coefficient, which represent measures of local (or close to local) proximity. The first three graph statistics are about distributions of each graph and are always represented as a vector, while the last four graph statistics are represented as scalar values of each graph. Therefore, to evaluate the distance between two sets of graphs in terms of the above distribution statistics, two major metrics are usually utilized in existing literature, which are introduced as follows.
Average Kullback-Leibler Divergence. Considering that each graph set has a set of distributions in terms of a graph property x, we first calculate the average distribution of the whole set. To get the average distribution of a graph set, the vectors of counts of the property x of all the graphs are first concatenated. Then the probability densities of the graph property x is calculated based on this concatenated vector as the average node degree distribution. Fianlly, the Kullback-Leibler divergence (KL-D [104]) between the average node degree distribution of the generated graph set P ave (x) and that of the real graph set Q ave (x) is calculated as: Maximum Mean Discrepancy (MMD) [105]. First, the squared MMD between the graph statistics distribution of the generated graph set P and that of the real graph set Q can be derived as: where x, y refer to the graph statistics that are sampled from the two distributions. The kernel k( * ) is designed as follows: where σ refers to the standard deviation of P or Q. Considering the sampled graph statistics are also two distributions, thus, W (x, y) is defined as the Wasserstein distances (WD): where (x, y) is the set of all measures whose marginals are x and y respectively.
Distance metrics for scalar-valued statistics. The calculation of distance between two sets of graphs in terms of the scalar-valued statistic is much easier than that of distribution statistics. There are two major ways: (1) calculating the difference between the averaged value of the scalar-valued statistic of the generated graph set and that of the real graph set; (2) calculating the distance between the distribution of the scalar-valued statistic of the generated graph set and that of the real graph set. Many distance metrics can be used, such as KL-D, Jensen-Shannon distances (JS), and the Hellinger distance (HD).

Classifier-based
Classifier-based evaluation typically utilizes a graph classifier to evaluate whether the generated graphs follows the same distribution as the real graphs without explicitly defining the graph statistics. Typically, a classifier is trained on the set of real graphs and is tested on the set of generated graphs. It only could be utilized when multiple graph generative models are trained for generating multiple types of graphs, respectively. Here we introduce two existing classifier-based evaluations [26] that are based on graph isomorphism network (GIN) [106] as follows.
Accuracy-based. First, a GIN is pre-trained on the training set consisting of multiple types of graphs previously used for training the generative model. Then for each type of generated graph, the classification accuracy of classifying this type of generated graphs based on the trained GIN is the final evaluation metric.

Fréchet Inception Distance (FID)-based. FID computes the distance in the embedding space between two multivariate
Gaussian distributions fitted to a generated set and a test set. A lower FID value indicates better generation quality and diversity. For each type of graph, first, the generated and real graphs in the testing set are inputted into the pretrained GIN to get the graph embeddings. Then the means µ G and covariance matrices G of the embeddings of the generated graph set, and the means µ R and covariance matrices R of real graphs are estimated. Finally, the FID metric for this type of graphs is computed as follows: where Tr(·) refers to the trance of a matrix.

Intrinsic-quality-based
Besides the evaluation by measuring the similarity between the real and generated graphs, there are three additional metrics that directly evaluate the quality of the generated graphs: their validity, uniqueness and novelty.
Validity. Since sometimes the generated graphs are required to preserve some properties, it is straightforward to evaluate them by judging whether they satisfy such requirements, such as the following: (1) Cycles graphs/Tree graphs: Cycles and trees are graphs that have obvious structural properties and the validity is calculated as what percentage of generated graphs are actually cycles or trees [44]. (2) Molecule graphs: Validity for molecule generation is the percentage of chemically valid molecules based on some domain specific rules [17].
Uniqueness. Ideally, high-quality generated graphs should be diverse and similar, but not identical. Thus, uniqueness is utilized to capture the diversity of generated graphs [17], [44], [47], [48], [62]. To calculate the uniqueness of a generated graph, the generated graphs that are sub-graph isomorphic to some other generated graphs are first removed. The percentage of graphs remaining after this operation is defined as Uniqueness. For example, if the model generates 100 graphs, all of which are identical, the uniqueness is 1/100 = 1%.
Novelty. Novelty measures the percentage of generated graphs that are not sub-graphs of the training graphs and vice versa [47], [48], [62]. Note that identical graphs are defined as graphs that are sub-graph isomorphic to each other. In other words, novelty checks if the model has learned to generalize unseen graphs.

Evaluation for conditional deep graph generation
In addition to the above general evaluation metrics for graph generation, for conditional deep generative models for graph generation, some additional evaluation metrics can be involved, including: graph-property-based and mapping-relationship-based evaluations.

Graph-property-based
Considering that each of generated graph can have its associated real graph as label in the conditional graph generation task, we can directly compare each generated graph to its label graph by measuring their similarity or distance based on some graph properties or kernels, such as the following: (1) random-walk kernel similarity by using the randomwalk based graph kernel [107]; (2) combination of Hamming and Ipsen-Mikhailov distances(HIM) [108]; (3) spectral entropies of the density matrices; (4) eigenvector centrality distance [109]; (5) closeness centrality distance [110]; (6) Weisfeiler Lehman kernel similarity [111]; (7) Neighborhood Sub-graph Pairwise Distance Kernel [47] by matching pairs of sub-graphs with different radii and distances.

Mapping-relationship-based
Mapping-relationship-based evaluation measures whether the learned relationship between the conditions and the generated graphs is consistent with the true relationship between the conditions and the real graphs.
Explicit mapping relationship. In the situation where the true relationship between the input conditions and the generated graphs is known in advance, the evaluation can be conducted as follows: (1) When the condition is a category label, we can examine whether the generated graph falls into the conditional category by utilizing a graph classifier [21], [24]. Specifically, the real graphs are used to train a classifier and the classifier is used to classify the generated graphs. Then the accuracy is calculated as the percentage of the predicted categories that are the same as the input condition. (2) When the condition is a graph, where the task is to change some properties of the input graph, we can quantitatively compare the property scores of the generated and input graphs to see if the change indeed meets the requirement. For example, one can compute the improvement of logP scores of the optimized molecule in molecule optimization task [86].

Implicit mapping relationship.
Regarding the deep graph translation problem, which is introduced in Section 3.1, sometimes, the underlying patterns of the mapping from the input graphs to the real target graphs are implicit and complex to define and measure. Thus, a classifier-based evaluation metric can be utilized [25]. By regarding the input and target graphs as two classes, it assumes that a classifier that is capable of distinguishing the generated target graphs would also succeed in distinguishing the real target graphs from the input graphs. Specifically, a graph classifier is first trained based on the input and generated target graphs. Then this trained graph classifier is tested to classify the input graph and real target graphs, and the results will be used as the evaluation metrics.

APPLICATIONS
Deep generative models for graph generation is a very active research domain with a continuously increasing number of applications being proposed, including important topics such as molecule optimization and generation, semantic parsing in NLP, code modeling, and pseudoindustrial SAT instance generation.

Molecule generation
Molecule generation is a challenging mathematical and computational problem in drug discovery and material science; its aim is to design novel molecules under a range of chemical properties. Any small perturbation in the chemical structure may result in a large variation in the desired molecular property. Besides, the space of valid molecules quickly becomes prohibitively huge and complex as the number of combinatorial permutations of atoms and bonds grows. Currently, most drugs are hand-crafted by human experts in chemistry and pharmacology. The recent advances of deep generative models for graph generation has opened a new research direction by treating the molecule as a graph with atoms as nodes and bonds as edges, with the potential to learn these molecular' generative representation for novel molecule generation to ensure chemical validity and efficiency [17], [27], [40], [43], [56], [112]. Representative Work. Junction Tree VAE (JT-VAE) [27] formalizes the molecular structures generation task into an unconditional graph generation problem, where each atom in a molecule is a node in the graph and the bonds between atoms are represented as edges. JT-VAE adopts a motif-sequence-based generation approach, one of a number of sequential-based generating techniques, to generates a molecular graph by sequentially expanding a generated molecule by adding a valid chemical substructure in each step. Figure 5 shows a backbone VAE-based generative model consisting of two encoders and decoders. Here, the molecular graph G is first decomposed into its junction tree T G , where each colored node in the tree represents a substructure in the molecule. Then both the tree and graph are encoded into their latent embeddings z T and z G . To decode the molecule, first step is to reconstruct the junction tree from z T , and then assemble nodes in the tree to return to the original molecule. : A molecular graph G is first decomposed into its junction tree T G , where each colored node in the tree represents a substructure in the molecule. Then both the tree and graph are encoded into their latent embeddings z T and z G . To decode, the junction tree is first reconstructed from z T .

Protein structure modeling
Proteins are massive molecules that can be characterized as one of the multiple long chains of amino acids. Analyzing the structure and function of proteins is a key part of understanding biological properties at the molecular and cellular level. Current computational modeling methods for protein design are slow and often require human oversight and intervention, which are often biased and incomplete. Inspired by recent momentum in deep graph generative models, some works [6], [7], [20], [113], [114] demonstrate the potential of deep graph generative modeling for fast generation of new, viable protein structures.
Representative Work. Guo et al [6] proposed a contact VAE (CO-VAE) to generate functionally relevant threedimensional protein structures. Here, the protein structure is formalized as a graph where each amino acid is a node and the physical distance between two amino acids determines the existence of an edge based on a pre-defined threshold. A graph generative model VAE is utilized to model and generate the graph by following the adjacent-matrix-based one-shot generating technique, where the node attributes and adjacent matrix of graph are generated in a single shot. As shown in Figure 6, a protein structure is first represented by a graph that consists of a node attribute matrix and edge attribute tensor. These two components are then input into the encoder of VAE to learn the distribution of the latent embedding of the graph. In the decoder, the node and edge attributes are generated based on the sampled latent embedding and can then be recovered to yield the protein based on a 3D reconstruction technique. Fig. 6. CO-VAE for protein structure modeling [6]: a protein graph represents the mutual distance between each pair of amino acids. Node and edge attributes are input into the encoder to learn the distribution of the latent embedding.

Semantic parsing
Semantic parsing problem is about mapping the natural language information to its logical forms, namely abstract meaning representation (AMR). Traditional semantic parsers are usually based on compositionally and manually designed grammar to create the structure of AMR, and used lexicons for semantic grounding, which is timeconsuming and heuristic. Recent works develops neural semantic parser with sequence-to-sequence models [115], [116], which, however, only consider the word sequence information and ignore other rich syntactic information. Because AMR are naturally structured objects (e.g. tree structures), semantic AMR parsing methods based on deep graph generative models are deemed as promising [10], [28], [89], [90], [117]. These methods represent the semantics of a sentence as a semantic graph (i.e., a sub-graph of a knowledge base) and treat semantic parsing as a semantic graph matching/generation process.
Representative Work. Zhang et al [10] formalized the AMR parsing as a graph generation problem conditioned on sequence, where the input is the sequence of tokens from a target sentence and the output its AMR graph. In the AMR graph, a node denotes to a word in the sentence and a predicted edge represents the semantic relationship between two words. This work is an edge-list-based one shot generation method, where the edges are generated based on pairs of node representations. As shown in Figure 7, the whole process consists of two stages: node and edge prediction. The node prediction utilizes an RNN-based generative models to generate the nodes selected from the tokens in the sentence. For the second stage (i.e., the edge prediction), a score matrix that measures the probability of the edge existence is learnt based on the representation vectors of each pair of nodes, after which the edge is generated by sampling from the score matrix. This end-to-end deep graph generation techniques for semantic parsing has demonstrated a powerful ability for automatically capturing semantic information. Fig. 7. A two-stage AMR parsing process for a sequence-to-graph problem [10]: node prediction is to generate nodes based on the input sequence of tokens and edge generation generates the edges by sampling from the score matrix calculated based on the node representations.

Code modeling
Code modelling considers both hard syntactic and semantic constraints in generating natural programming code, which can make the development of source code easier, faster, and less error-prone. Early works in this area have shown that approaches from natural language processing can be applied successfully to the source code. However, though these methods are successful at generating programs that satisfy some formal specifications, they cannot generate realistic-looking and valid programs. Since program graphs have been shown to have the ability to encode semantically meaningful representations of programs, deep graph generative models have shown promising capability in modeling small but semantic programs generation [8], [9], [118], [119].
Representative Work. Brockschmidt et al. [8] formalized the code modeling as a graph structure generation problem, where the source code is represented by an abstract syntax tree, as shown in Figure 8. In this abstract syntax tree, each node refers to a construct occurring in the codes and the edges denote to the semantic relationships. The generation process is a rule-based sequentially generating techniques, where the code is represented as an abstract syntax tree (AST), which incorporates rich structural information. An AST is then generated by expanding one node at a time using production rules from the underlying programming language grammar. This simplifies the code generation task to a sequence of sampling problems, in which an appropriate production rule must be sampled based on the partial AST generated so far.

Pseudo-industrial SAT instance generation
The problem of pseudo-industrial Boolean Satisfiability (SAT) instance generation is about generating artificial SAT problems that display the same characteristics as their real-world counterparts. Generating large amounts of Fig. 8. Representing a program as an abstract syntax tree [8], [120]: each node refers to a construct occurring in the codes and the edges denote to the semantic relationships.
SAT instances is important in developing and evaluating practical SAT solvers, which historically relies on extensive empirical testing on a large amount of SAT instances. Prior works addressing this problem relied on hand-crafted algorithms, but have difficult in simultaneously capturing a wide range of characteristics exhibited by real-world SAT instances [121], [122]. Thus, it is promising to represent SAT formulas as graphs, thus recasting the original problem as a deep graph generation task [123], [124].
Representative Work. G2SAT [123] formalizes the SAT generation task as a graph generation problem by representing the SAT as a bipartite graph, where each node represents either a literal or a clause, with an edge denoting the occurrence of a literal in a clause representing a disjunction operation. In general, the generation process is a motifsequence-based generating style where a new motif is added to the partially generated graph in each step. The motifs refer to the trees that are split from the existing training bipartite graphs. As shown in Figure 9, while generating, G2SAT generates a bipartite graph by starting with a set of motifs. In each step, a new motif is added by merging one of its clause nodes with an existing node in the partially generated graph. At last, all the conjunction clauses are combined with conjunction operations to recover the SAT formula. Fig. 9. An overview of the G2SAT model [123]: In each step, two clause nodes are merged into a single clause node. A GCN-based classifier that captures the bipartite graph structure is used to sequentially decide which nodes to merge.  [18], [19], [26], [47], [69] and O(M ) [112], where N is the number of nodes and M is the number of edges. Consequentially, most existing works merely focus on small graphs, typically with dozens to thousands of nodes [2], [21], [44], [53], [56], [72], [86]. However, many real-world networks are large, with millions to billions of nodes [47], such as the Internet, biological neuronal networks, and social networks. It is important for any generative model to scale to large graph.

FUTURE OPPORTUNITIES
Validity constraint. Many real-world networks are constrained by specific validity requirements [54]. For example, in molecular graphs, the number of bondingelectron pairs cannot exceed the valency of an atom. In protein interaction networks, two proteins may be connected only when they belong to the same or correlated gene ontology terms. Graph-topological constraints are challenging to enforce during the model training process. Intuitive ways include designing heuristic and customized algorithms to ensure the validity of generated graphs. For example, Dai et al. [9] further apply attribute grammar as a constraint in the parse-tree generation, a step toward semantic validity. Some recent works started to construct a more generic framework under constrained optimization scenario, which minimizes training loss under graph validity constraints [54]. However, as such constraints are typically discrete and non-differentiable, they need to be approximated with a smooth relaxation which introduces errors and cannot preclude all the invalid topologies.
Interpretability. When we learn the underlying distribution of complex structured data, i.e. graphs, learning interpretable representations of data that expose semantic meaning is very important [125]. For example, it is highly beneficial if we could identify which latent variable(s) control(s) which specific properties (e.g., molecule mass) of the generated graphs (e.g., molecules). It is also useful to disentangle local generative dependencies among different sub-graphs. However, existing works on this topic only focus on graph embedding but not generation [126], [127]. For example, Stoehr et al [128] demonstrates the potential of latent variable disentanglement in graph deep learning for unsupervised discovery of generative parameters of random and real-world graphs. Investigations on graph decoding and generation are still open problems without existing works except very recently published ones [129], [130], [131].
Beyond training data. Deep generative models are datadriven models based on training data. The novelty of the generated graphs are highly desired yet usually restricted by training data and model properties (e.g., mode collapse of generative adversarial nets). To address such issues, attempts in the domain of images modified the attribute of a generated image by adding a learned vector on its latent code [132] or by combining the latent code of two images [133]. Additional works have been developed for inserting extra control in the image generation [132] with additional labels corresponding to key factors such as object size and facial expression. However, works on graph generation that could require very different technique sets than image generation are lacking.
Dynamic graphs. Existing deep graph generative models typically focus on static graphs but many graphs in the realworld are dynamic, and their node attributes and topology can evolve over time, such as social network, mobility network, and protein folding. Representation learning for dynamic graphs is a hot domain, but it only focuses on graph embedding instead of generation. Modeling and understanding the generation of dynamic graphs have not been explored. Therefore, additional problems such as jointly modeling temporal and graph patterns and temporal validity constraints need to be addressed.

CONCLUSION
In this survey paper, we provides a systematic review of deep generative models for graph generation. We present a taxonomy of deep graph generative models based on problem settings and techniques details, followed by a detailed introduction, comparison, and discussion about them. We also conduct a systematic review of the evaluation measures of deep graph generative models, including the general evaluation metrics for both unconditional and conditional graph generation. After that, we summarized popular applications in this domain.

APPENDIX A PRELIMINARIES KNOWLEDGE OF DEEP GENERA-TIVE MODELS
In recent years, there has been a resurgence of interest in deep generative models, which have been at the forefront of deep unsupervised learning for the last decade. The reason for that is because they offer a very efficient way to analyze and understand unlabeled data. The idea behind generative models is to capture the inner probabilistic distribution that generates a class of data to generate similar data [29]. Emerging approaches such as generative adversarial networks (GANs) [23], variational auto-encoders (VAEs) [22], generative recursive neural network (generative RNN) [30] (e.g., pixelRNNs, RNN language models), flow-based learning [31], and many of their variants and extensions have led to impressive results in myriads of applications. In this section, we provide a review of five popular and classic deep generative models for learning the distributions by observing large amounts of data in any format. They include VAE, GANs, generative RNN, flow-based learning, and Reinforcement Learning, which also form the backbone of the base learning methods of all the existing deep generative models for graph generation.

A.1 Variational Auto-encoders
VAE [22] is a latent variable-based model that pairs a top-down generator with a bottom-up inference network. Instead of directly performing maximum likelihood estimation on the intractable marginal log-likelihood, training is done by optimizing the tractable evidence lower bound (ELBO). Suppose we have a dataset of samples x from a distribution parameterized by ground truth generative latent codes z ∈ R c (c refers to the length of the latent codes). VAE aims to learn a joint distribution between the latent space z ∼ p(z) and the input space x ∼ p(x). Specifically, in the probabilistic setting of a VAE, the encoder is defined by a variational posterior q φ (z|x), while the decoder is defined by a generative distribution p θ (x|z), as represented by the two orange trapezoids in Fig. 10(a). φ, θ are trainable parameters of the encoder and decoder. The VAE aims to learn a marginal likelihood of the data in a generative process as: max Then the marginal likelihoods of individual data points can be rewritten as follows: where the first term stands for the non-negative Kullback-Leibler divergence between the true and the approximate posterior; the second term is called the (variational) lower bound on the marginal likelihood. Thus, maximizing L(φ, θ; x, z) is to maximize the lower bound of the true objective L(φ, θ; x, z) = E q φ (z|x) [logp θ (x|z)] − D KL (q φ (z|x)||p(z)). In order to make the optimization of the above objective tractable in practice, we assume a simple prior distribution p(z) as a standard Gaussian N (0, I) with a diagonal co-variance matrix. Parameterizing the distributions in this way allows for the use of the "reparameterization trick" to estimate gradients of the lower bound with respect to the parameter φ, where each random variable z i ∼ q φ (z i |x) is parameterized as Gaussian with a differentiable transformation of a noise variable ∼ N (0, 1), that is, z is computed as z = µ+σ , where µ and σ are outputs from the encoder.

A.2 Generative Adversarial Nets
GANs were introduced as an alternative way to train a generative model [23]. GANs are based on a game theory scenario called the min-max game, where a discriminator and a generator compete against each other. The generator generates data from stochastic noise, and the discriminator tries to tell whether it is real (coming from a training set) or fabricated (from the generator). The absolute difference between carefully calculated rewards from both networks is minimized so that both networks learn simultaneously as they try to outperform each other. Specifically, the architecture of GANs consists of two 'adversarial' models: a generative model G θ which captures the data distribution p(x), and a discriminative model D φ which estimates the probability that a sample comes from the training set rather than G θ , as shown in Fig.10(c). Both G θ and D φ could be a non-linear mapping function, such as a multi-layer perceptron [134] parameterized by parameters θ and φ. To learn a generator distribution p model (x) of observed data x, the generator builds a mapping function from a prior noise distribution p z (z) to data space as G θ (z). And the discriminator, D φ (x), outputs a single scalar representing the probability that the input data x came form the training data rather than sampled from p model (x).
The generator and discriminator are both trained simultaneously by adjusting the parameters of p model (x) to minimize log(1 − D φ (G θ (z)) and adjusting the parameters of D φ to minimize logD φ (x), as if they are following the two-player min-max game with value function V (G θ , D φ ): The training of the generator and discriminator is kept alternating until the generator can hopefully generate reallike data that is difficult to discriminate from real samples by a strong discriminator.
In general, GANs show great power in generating data such as image [23], [135], audio [136], and texts [137]. In contrast to VAE, GANs learn to generate samples without assuming an approximate distribution. By utilizing the discriminator, GANs avoid optimizing the explicit likelihood loss function, which may explain their ability to produce high-quality objects as demonstrated by [135]. However, GANs still have drawbacks. One is that they can sometimes be extremely hard to train in adversarial style. They may fall into the divergence trap very easily by getting stuck in a poor local minimum. Mode collapse is also an issue, where the generator produces samples that belong to a limited set of modes, which results in low diversity. Moreover, alternatively training and large computation workloads for two networks can result in long-term convergence process.

A.3 Generative Recursive Neural Network
RNN [138] is a straightforward adaptation of the standard feed-forward neural network by using their internal state (memory) to process variable length sequential data. At each step, the RNN predicts the output depending on the previous computed hidden states and updates its current hidden state, that it, they have a "memory" that captures information about what has been calculated so far. The RNN's high dimensional hidden state and nonlinear evolution endow it with great expressive power to integrate information over many iterative steps for accurate predictions. Even if the non-linearity used by each unit is quite simple, iterating it over time leads to very rich dynamics [30]. sequential objects by taking a series of actions [144]. A sequential object is generated based on a sequence of actions that are taken.
During the generation, DQN selects the action at each step using an -greedy implementation. With probability , a random action is selected from the range of possible actions, otherwise the action which results in high Q-value score is selected. To perform experience replay, the agent's experiences e t = (s t , a t , r t , s t+1 ) at each time-step t are stored in a data set D t = {e 1 , . . . , e t }. At each iteration i in the learning process, the updates of the learning weights are applied on samples of experience (s t , a t , r t , s t+1 ) ∼ U (D), drawn randomly from the pool of stored samples, with the following loss function: where θ i refers to the parameters of the Q-network at iteration i and θ − i refers to the network parameters used to compute the target at iteration i. The target network parameters θ − i are only updated with the Q-network parameters θ i every several steps and are held fixed between individual updates. The process of generating the data after training is similar to that of the training process.

APPENDIX B BENCHMARK RESULTS AND DATASETS
As deep graph generation is a relatively new research area, it is important to quantitatively compare the performance of the existing algorithms, and provide the unified benchmark dataset for the future new algorithms and research. In this section, we first summarize the existing benchmark datasets that are used to evaluate the existing models. Next, we compared the published results of the existing deep generative models by using the evaluation metrics introduced in Section 5 3 .

B.1 Datasets
The existing benchmark datasets that are typically used in this domain can be categorized into synthetic datasets and real-world datasets. We have collected and published all the datasets via this link: https://github.com/xguo7/ Dataset-for-Deep-Graph-Generation.

B.1.1 Synthetic Datasets
Followings show the synthetic graph datasets that are typically used in the existing methods.
Community. It contains 500 two-community graphs with 60 ≤ |V| ≤ 160. V denotes the node set of a graph. Each community is generated by the Erdős-Rényi model (E-R) [15] with n = |V |/2 nodes and the link probability of p = 0.3.
Grid. It contains 100 standard 2D grid graphs with 100 ≤ |V| ≤ 400 and 100 standard large 2D grid graphs with 1296 ≤ |V| ≤ 2025. 3. We only consider unconditional graph generation in this section due to the small number of the existing conditional graph generation methods.
B-A. 500 graphs with 100 ≤ |V| ≤ 200 that are generated using the Barabasi-Albert model. During generation, each node is connected to 4 existing nodes.
Cycles. A synthetic dataset of graphs with cyclically connected nodes. Each graph is a path with its two endnodes connected. 500 graphs are generated with size of 10 ≤ |V| ≤ 100.
Trees. A synthetic dataset of 500 trees (10 ≤ |V| ≤ 100) with power law degree distributions. To generate a tree, a trial power law degree sequence is chosen and then elements are swapped with new elements from a powerlaw distribution until the sequence makes a tree.
Ladder. A synsthtic dataset of ladder graphs with 10 ≤ |V| ≤ 100, resulting in a total size of 180 graphs. This is two paths of |V|/2 nodes, with each pair connected by a single edge.

B.1.2 Real-world Datasets
Followings shows the real-world dataset that are typically used in the existing methods.
QM9 [146]. It is an enumeration of around 134k stable organic molecules with up to 9 heavy atoms (carbon, oxygen, nitrogen and fluorine). As no filtering is applied, the molecules in this dataset only reflect basic structural constraints.
ZINC [147]. This dataset is a curated set of 250k commercially available drug-like chemical compounds. On average, these molecules are bigger (about 23 heavy atoms) and structurally more complex than the molecules in QM9.
CEPDB [148]. This dataset consists of organic molecules with an emphasis on photo-voltaic applications. The contained molecules have 28 heavy atoms on average and contain six to seven rings each.
Protein [149]. This dataset contains 918 protein graphs with 100 ≤ |V| ≤ 500. Each protein is represented by a graph, where nodes are amino acids and two nodes are connected if they are less than 6 Angstroms apart.
Enzymes [150]. This dataset contains protein tertiary structures representing 600 enzymes. Nodes in a graph (protein) represent secondary structure elements, and two nodes are connected if the corresponding elements are interacting. The node labels indicate the type of secondary structure, which is either helices, turns, or sheets.
Citation graphs [145]: Cora and Citeseer are citation networks; nodes correspond to publications and an edge represents one paper citing the other. Node labels represent the publication area. The Cora dataset contains 2708 nodes, 5429 edges, 7 classes and 1433 features per node. The Citeseer dataset contains 3327 nodes, 4732 edges, 6 classes and 3703 features per node

B.2 Results
To compare the different deep generative models on graphs, we summarize their published experimental evaluation results on two main graph generation tasks. One category is domain-agnostic where the general graph generation tasks TABLE 4 Quantitative evaluation and comparison on general graph generation tasks by different deep generative models on graphs (N refers to the number of nodes in the graph, "D." refers to MMD for node degree, "C." refers to MMD for clustering coefficient distributions, and "O." refers to MMD for average orbit counts statistics. "-" denotes the unavailability of the published results from the original papers).  [44] 0 One-shot GraphAF [69] 0.060 0.100 0.015 0.040 0.040 0.008 ---O(N 2 ) One-shot GraphVAE [53] 0  Quantitative evaluation and comparison on molecule structure generation tasks by different deep generative models on graphs (C refers to number of motifs and N refers to number of nodes in the graph. "Unique." is short for uniqueness. "Novel." is short for novelty."Valid." is short for validness). One-shot are conducted without incorporating the domain knowledge to guarantee the special properties of the graphs, such as synthetic graph generation. The other category is domain-aware, such as molecule structure generation, which requires to consider the validness and properties of the generated nodes, edges, and the whole graphs.

B.2.1 Results on Domain-agnostic Graph Generation
The aforementioned benchmark datasets are widely used to evaluate and compare the existing methods. 10 state-of-the-art methods have been compared on three highest-frequently used datasets in Table 4. The 10 methods cover both one-shot and sequential generating styles. The three highest-frequently used datasets are Community, Ego, and Protein 4 . In order to evaluate the quality of the generated graphs, the Maximum Mean Discrepancy (MMD) [18] metric is utilized to measure the distance between the learn graph distribution and the real graph distribution, regarding node degree, clustering coefficient, as well as average orbit counts statistics, as described in Section 5. In addition, we also compare the time complexity of these models to reflect their efficiency. In the following, we analyze the experimental results by focusing on a few 4. We categorize it into the domain-agnostic task since it is commonly used without considering the protein properties of the graphs. issues illustrated as below.
Comparison between sequential-based and one-shotbased graph generation for domain-agnostic graphs. Based on experimental results, we observed that the sequential-based methods, especially RNN-based models, deliver a better performance than the one-shot models for many benchmark datasets. As shown in Table 4, GraphRNN and GraphVRNN achieve the best performance on the Community and Ego datasets, with GraphVRNN returning to the lowest MMD of 0.025, on average, for the Community dataset and with the MMD score for the other methods averaging 0.246. For the Ego dataset, GraphRNN delivered the lowest average MMD of 0.017, while the MMD scores for other methods is around 0.060. This may be because sequential-based generation is better at modeling the complex dependency among the nodes and edges in a graph as the conditional distribution of each node or edge is modeled given the partially generated graph.
Influence of the attention mechanism on graph generation. The comparison results presented above suggest that the attention mechanism supports better learning for the graph generation process. As shown in Table 4, among the sequential-based graph generation methods, for the Protein dataset the attention-based recurrent neural network GRAN and GRAN-I perform turned in the best performance, with the smallest average MMD scores of 0.063 and 0.046, respectively. This is because the attention mechanism helps distinguish multiple newly added nodes and learns different attention weights for different types of edges during the generation process, thus delivering a more powerful learning capability.
Experimental comparisons of complexity. Based on the complexity results shown in Table 4, most of the graph generation methods have a complexity of O(N 2 ). In sequentialbased graph generation methods, it is possible to improve the scalability of the generation model from O(N 2 ) to O(N · |E|) by implementing an permutation invariant strategy, such as GRAN [49]. This is because the complexity challenge of the graph generation model arises primarily in the node permutation step when calculating the loss function for optimization. In the case of one-shot generation, since each graph is represented as its adjacent matrix, which limits the complexity of O(N 2 ).

B.2.2 Results on Domain-aware Graph Generation
Among the domain-aware graph generation tasks, molecule generation is the most popularly explored problem that is handled by a large number of works. 13 domain-specific models that have published molecule generation results on two highest-frequently-used benchmark datasets (i.e., QM9 and ZINC) are compared, as shown in Table 5. Different from domain-agnostic graph generation, the domain-aware graph generation task values most on the validness of the generated graphs as well as their diversity and novelty regarding the requirement of novel structure design. Thus, three metrics, namely, uniqueness, novelty and validness, are utilized. In the following, we also analyze the experimental results by focusing on a few issues illustrated as below.
Influence of domain-specific knowledge on molecule generation. Table 5 shows the experimental results for both general graph generation techniques (e.g., GraphVAE, GRF, GraphNVP) and domain-specific graph generation techniques (e.g., JT-VAE and GCPN) when dealing with the molecule generation/optimization problem. Based on these results, incorporating domain-specific knowledge in the form of either learning rewards or regularization can help to generate more valid and unique graphs. For example, JT-VAE and GCPN show better performance than the other methods on the ZINC dataset, especially in terms of Uniqueness and Validness. Specifically, for the ZINC dataset, JT-VAE is the only method that achieve 100% uniqueness, delivering a performance that is about 26.41% higher than that of the other methods on average. This is because the inclusion of domain-specific knowledge allows the direct optimization of application-specific objectives, while still ensuring that the generated molecules are realistic and satisfy chemical rules.
Experimental comparison of complexity on molecule generation. As shown in Table 5, the motif-sequential based graph generation methods delivered the most efficient generation process with the lowest complexity. For example, JT-VAE and GCPN achieved higher scalability (i.e., O(C)) than other models by utilizing a motif-based sequential generating technique. This is because the number of generation iterations are reduced considerably by decomposing all of the nodes into several motif groups, thus reducing the complexity to O(C) where C refers to the number of motifs.
Experimental comparison of sequential-based and oneshot graph generation for molecule generation. As shown in Table 5, methods based on node-sequential and motifsequential generating techniques, such as CGVAE and MolecularRNN, are also powerful ways to generate molecular structures with high uniqueness, novelty and validity compared to the one-shot generation methods. For example, for the ZINC dataset, the average unique and validity achieved by the sequential-based methods are 81.99% and 86.16%, respectively, which are 12.34% and 26.84% higher than those of one-shot based methods. Sequential generating techniques, especially motif-based sequential techniques, are more effective in molecule generation tasks for two main reasons: (1) sequential-based methods are capable of modeling the distribution of graph size (i.e., the number of nodes), which varies naturally; and (2) generating a graph based on given motifs (i.e., at a coarse-grained level) decreases the risk of obtaining an invalid results compared to generating a graph based on nodes and edges (i.e., at a fine-grained level).