Evolutionary job scheduling with optimized population by deep reinforcement learning

The sorting operation of the production line in a heavy industrial scenario has the double complexity of the problem and the data. To improve production efficiency, the operation needs to be optimized. Aimed at this problem, this article designs a data representation method and an evolutionary job scheduling algorithm with an optimized population by deep reinforcement learning (DRL). Moreover, a real industrial dataset is contributed. The representation method represents the job data by referring to the bag-of-words model. The evolutionary algorithm uses DRL to initialize the genetic algorithm (GA)'s population and further evolves the population through the GA to obtain the final scheduling result. The experimental results indicate that the evolutionary algorithm has achieved the largest decrease in the average times for frame clearing on the real and simulated validation datasets, which are 12.54 and 11.43 , respectively. It is of great significance for subsequent scheduling of the full-scenario digital twin.


Introduction
With the slowdown of economic growth and the decline of the demographic dividend, it has become an inevitable trend to improve the production efficiency of production lines and the competitiveness of manufacturing through intelligent technology (Hu et al. 2021). This work focuses on production line steel plate sorting sequence optimization in heavy manufacturing industry, especially vehicle machinery manufacturing. As a real industrial scheduling problem, if each steel plate to be sorted is regarded as a node, the scheduling problem can be abstracted as an unclosed travelling salesman problem (TSP) when the optimal sorting sequence of plates needs to be solved. Note that the sorting sequence is one of the manifestations of scheduling in this environment, and a better sequence can lead to a better objective value. Solution of the above scheduling problem not only needs to optimize the sorting sequence but also has strict requirements on the time cost of the solving process.
Frame clearing will occur when sorting the steel plates, and each cleaning operation requires plenty of transit time, while a reasonable plate sorting sequence can reduce the times for frame clearing, thereby reducing the transit time. To facilitate the study of plate sorting sequence optimization, the plate data should be represented abstractly. However, formal modelling and public datasets on the sorting sequence optimization of steel plates in heavy manufacturing industry are lacking, to the best knowledge of the present authors. An appropriate optimization algorithm is also needed for solving a better sorting sequence with different data scales.
Meta-heuristic algorithms are widely used in the optimization field. The advantage of the traditional meta-heuristic algorithm is that it is suitable for solving low-dimensional data, and can find a reliable solution in a reasonable time (Leu and Hwang 2001;Sebt, Afshar, and Alipouri 2017;Feng et al. 2019;Yu et al. 2018). However, the disadvantage of the traditional meta-heuristic algorithm is that it needs to be combined with other methods, such as statistical methods and so on, to deal with large-scale scheduling problems. For example, J. Zhang and Wang (2020) introduced random walk Lévy flight into the traditional meta-heuristic algorithm, and their experimental results showed that the improved algorithm could solve the high-dimensional large-scale optimization problem effectively, but its robustness was not strong. Cao, Lin, and Zhou (2019) combined the meta-heuristic algorithm with external knowledge bases to solve large-scale optimization problems. However, external knowledge bases are often lacking in real scenarios.
The training of deep reinforcement learning (DRL) doesn't require label data, and the trained policy network only needs forward calculation, which can save a lot of computing time. While DRL shows good generalization for large-scale scheduling problems, it is not perfect for solving input data with different dimensions. How to achieve good performance on scheduling data with different dimensions is a challenging task. Based on the above motivation, this work considers combining DRL with a traditional meta-heuristic algorithm to solve job scheduling problems with different data dimensions.
The main contributions of the present work are as follows.
(1) In a scenario of steel plate sorting, this work formally models the problem by abstractly representing the complex plate data through using the bag-of-words model. (2) For scheduling data of different scales, a new evolutionary algorithm (EA) is proposed that initializes the population in the genetic algorithm (GA) through the solution result of a policy network in DRL, and then uses the GA to improve the initial result further. (3) The real production data of the subordinate production line of a construction machinery manufacturing enterprise are sorted out. After desensitization, a real production scheduling dataset is formed and exposed, which can be researched by relevant personnel.
The following structure of the article is arranged as follows: Section 2 describes related work. Section 3 presents the problem definition and data representation. Section 4 presents the methodologies. Section 5 gives the experimental results and analysis. Section 6 lists the conclusions.

Related work
The meta-heuristic method is introduced into the problem of job scheduling optimization (Saragih et al. 2019;Bitam, Zeadally, and Mellouk 2018) because of its small amount of calculation and easy implementation. It mainly solves optimization problems by imitating biological or physical phenomena (Lim et al. 2009;Mirjalili and Lewis 2016). A common feature of these algorithms is that the search process is divided into two stages: exploration and exploitation. How to find a suitable balance between exploration and exploitation is a difficulty in the application of any meta-heuristic algorithm. Reinforcement learning (RL) is a kind of sequential decision-making method (Kaelbling, Littman, and Moore 1996). With the outbreak of deep learning, DRL shows a more powerful representation and generalization ability, which has been widely used in combinatorial optimization (L. Liu et al. 2016;Cho et al. 2014).
Simulated annealing (SA) and GAs are two classical types of meta-heuristic. SA obtains the optimal solution on the FT10 standard scheduling problem (Yamada and Nakano 1996). However, when the dimension of the input data is large, experimental results show that traditional meta-heuristics (such as GAs and SA) are likely to fall into local optima owing to deterioration of the dimension, and the convergence speed becomes slower and slower (Wu and Wang 2021; T. Zhang et al. 2021).
Sequence-to-sequence, which maps one sequence to another, has been a widely used technology in recent years. In the experiments of Bahdanau, Cho, and Bengio (2015), they used the attention mechanism to improve the sequence-to-sequence and produced a better result. Because of its practicability, the concept of the attention mechanism has been widely used in computer vision (X. Liu et al. 2019;Wang et al. 2018), machine translation (Luong, Pham, and Manning 2015), speech recognition (Park et al. 2019;Ravanelli, Parcollet, and Bengio 2019) and other fields.
More and more research workers have applied the sequence-to-sequence structure with the attention mechanism to combinatorial optimization. The first were Vinyals, Fortunato, and Jaitly (2015), who used a variant of the sequence-to-sequence model called the 'Pointer network'. They used a supervized method to train the network, but the labelled training data came from an heuristic algorithm. Therefore, it was difficult for the network to achieve a better performance than the corresponding heuristic algorithm. Bello et al. (2016) used RL to train a Pointer network. Compared with Vinyals, Fortunato, and Jaitly (2015), RL brought more exploration space to the strategy model. Inspired by the popular transformer network in natural language processing, Kool, Van Hoof, and Welling (2018) improved the transformer's structure for combinatorial optimization problems and trained the network with RL. Their optimal result was obtained on TSP with 100 nodes. Note that, since the Pointer network was proposed in 2015, it has been possible to use neural networks to solve combinatorial optimization problems (Kaiwen, Tao, and Rui 2020). Figure 1 shows the sorting sequence scheduling optimization problem in a manufacturing production line. There are N pieces of steel plate that need to be sorted, and each plate contains several parts. The typical distribution of real parts in the plate can be found in Figure 1 of the online supplemental data, which can be accessed at https://doi.org/10.1080/0305215X.2021.2013479. The parts need to be placed in the material frame by the mechanical arm. To meet industrial standardization process requirements, only parts of the same type can be stacked in the frame. The placement area in the frame is divided into T areas in advance, and the part types placed in each area are different. This means a frame can only load T different types of part at most and the number of stacking layers for the same part is not more than L. The corresponding schematic diagram of stacking rules is shown in the front and top views of the material frame in Figure 1.  Symbol Definition

Problem definition
x The data on N steel plates with sorting sequence, which is recorded as [··· P i−1 P i ··· P j−1 P j ··· ]. P represents the data on a steel plate, where i, j ≤ N ∈ N + and i = j. V θ The number of stacking layers for the same part in any area of the current frame, The type number of parts in the current frame, The input is x. The set of steel plates in the given sorting sequence is sorted according to the stacking rules. The output is the generated number of frame clearing times.
For example, if one plate to be sorted contains L/2 pieces of parts of a certain type, and the next plate to be sorted contains L/2 + 1 pieces or parts of the same type, then the number of stacking layers for parts in a certain area in the frame will exceed L, which will trigger frame clearing. Another case is that, if the previous plate contains T/2 different types of single part and the next plate contains T/2 + 1 different types of single part, the loading area delimited in advance in the frame will be fully occupied, and frame clearing will also be triggered.
The definitions of symbols related to the optimization problem in this article are given in Table 1; x means the data on N steel plates with their sorting sequence. Once θ > T or V θ > L in the plate sorting, the frame should be cleared. Frequent frame clearing requires repeated scheduling of the automated guided vehicle, which consumes a lot of manpower and material resources. In general, the planning of intelligent factory should minimize ineffective material handling. Therefore, it is necessary to adjust the sorting sequence of steel plates and reduce the frequency of frame clearing by implementing a better sorting sequence of plates. The optimization objective and constraints of the problem are described by Equations (1) and (2). The optimization objective is minimizing F(x). F(x) is the number of frame clearing times. Its mathematical definition is recursive. The F(x) value is zero during initialization, which means the sort operation has just begun, and F(x) is increased by one each time frame clearing is triggered, which means that the frame cannot continue loading incoming parts. When all the plates are sorted, the value of F(x) does not change. The constraints in Equation (2) mean the stacking rules of parts in a frame. In this work, T = 12, L = 10.

Data representation
Because a plate contains several parts and there are many kinds of part in the production line, this work uses the bag-of-words model for reference in the abstract representation of the plate data. Firstly, a fixed order part dictionary for all parts should be created, which has D items in total. Then a D-dimensional all-zero vector corresponding to the order of parts in the dictionary is created. When a plate needs to be represented, the parts contained in the plate are traversed. If the number of parts of a certain type is k, the number in the corresponding position in the vector is added with k. The final D-dimensional vector is used as the original representation of the plate. As shown in Figure 2, the parts in the plate P 1 are traversed. When a part is accessed, the number in the corresponding position in the vector is added with one. Meanwhile, for the convenience of research, this work agreed that the sorting rule of the manipulator would be that parts are sorted according to their order in the dictionary until all the parts of one type are sorted, then the next type of part will be sorted. This sorting rule also fits in with the knowledge of the layout of parts, which is that parts of the same type tend to be arranged together  when the steel plate is nesting so as to maximize the space utilization in the plate, while it also helps to reduce repeated movement of the manipulator during sorting.

Evolutionary algorithm based on population optimization by DRL
The structure of the EA is shown in Figure 3 (the corresponding pseudo-code can be found in Algorithm 1). Firstly, the policy networks (PointerNet and transformer) are trained by deep reinforcement learning, and two parameterized policies are obtained. Then, the original plate data is represented by the data representation method described in Figure 2, and the corresponding D-dimensional vector will firstly be input into the trained deep policy networks. So two initial solutions are acquired. Finally, the two initial solutions are supplemented to the GA's initial population, which is generated randomly to enhance the population's diversity. Meanwhile, the GA has higher quality solutions generated by DRL in the initial stage of the search, which could reduce the time cost in the exploration stage and make GA invest more energy in the exploitation stage, which accelerates the convergence of the search process and improves its reliability. The individual with the best objective value in the final population of the GA is regarded as the best sorting sequence.

Coding and fitness function
The problem's solution needs to be encoded as an individual contained in the population of the GA. When the population is initialized, a combination sequence of the plates is randomly selected as the initial value of the individual. The detailed explanation of initialization can be seen in Figure 2(a) of the online supplemental data. The fitness function is F(x).

Operator definition
Selection operator: Firstly, the F(x) of the individuals in the population is calculated in turn. Then the reciprocal of the fitness value is taken as the probability of being selected. Finally, a group of individuals with smaller F(x) is selected as the evolutionary father and mother by a roulette algorithm.
Crossover operator: When the new individual is generated by the crossover operator, the first step is to select randomly the starting subscript start and the ending subscript end. Then, the sequence fragment x f [start : end] in the father x f is preserved and passed on to the next generation. All of the genes of x f [start : end] are deleted in the mother x m and the sequence fragment of x f [start : end] is inserted at the start position of the mother x m . Thus, a new individual is formed. See Figure 2(b) in the online supplemental data for a detailed example.
Mutation operator: The single point crossover mutation is used in the mutation operator. Two different positions i and j in the individual x are selected randomly, and then the genes at these positions should be exchanged.
After the initial population is randomly generated, genetic operations such as selection, crossover and mutation are performed continuously. The search process will be ended when the maximum number of iterations has been executed or the optimal fitness value has reached the convergence state (where the convergence state is defined as that the optimal individual in the population does not improve for five consecutive times). Otherwise, the genetic operations continue.

RL
The Actor-Critic (A2C) framework is used for training in DRL. PointerNet or the transformer network is used as an actor for decision-making, and a relatively simple model is used as a critic to guide the actor to update parameters.
RL can be abstracted as a Markov decision-making process, which is represented by the four-tuples s, a, r, p , where s is the state, a is the action, r is the reward given by the environment and p is the probability of state transition; s is defined as all plate data and the serial numbers of plates selected at all previous decision-making times. The initial value of r is zero. When the steel plate corresponding to the action is processed, if the number of frame clearing times is increased by one and reward r minus one. The policy is noted as π and p = π(s, a).
The calculation of the actor's loss function is shown in Equation (3), where n is the number of training samples, A π is the advantage function, and its definition is shown in Equation (5). Equation (6) defines the action-value function in policy π. The action-value function Q(a,s) under π is the reward r obtained after taking action a in the current state s plus the value function of the next state s at γ -fold attenuation, while Equation (7) defines the value function of the state.
It can be seen that V π is the expected value of Q π . The calculation of the critic's loss function is shown in Equation (4) (Barto, Sutton, and Anderson 1983). The term in brackets is the time difference error, and the sum of squares of the error is the critic's loss value.

Deep neural network
Because many components in the original D-dimensional vector are zero, so the vector is relatively sparse, while each neuron in the fully connected layer contains linear transformation and nonlinear activation functions that can transform zero into non-zero values and reduce data sparsity. This work uses a fully connected layer with M dimensions (M < D) to obtain an M-dimensional dense vector as the embedding representation. After that, the input to the decision network is the embedding representation of the plate. The embedding layer and the decision network are trained jointly. The decision network PointerNet is a typical encode-decode architecture. It can be learnt from Figure 4 that the encoder and decoder use a single-layer LSTM. Simultaneously, the attention mechanism is modified so that the output of each decoder is the probability vector that each input plate may be selected. The probability vector has dimension N, which is the same as the length of the sequence input of the encoder and solved the problem when the length of the output vector is fixed. The attention mechanism in the Pointer network can be briefly described by the following equations: where e j is the hidden layer's output of the encoder in the jth time in the time series, and d i is the hidden layer's output of the decoder at the ith time in the time series; u i = [u i 1 , u i 2 , . . . , u i N ]. Finally, softmax operates on u i , which can obtain the probability vector of all plates being selected. P all represents all the plate data, and C i is the serial number of the plate selected at time i; v T , W 1 and W 2 are trainable parameters of fixed dimension.
The transformer is also divided into two parts: the encoder and the decoder. A detailed description can be seen in the online supplemental data.

Experimental results
The experimental data comes from a smart factory affiliated to a construction machinery manufacturing enterprise. Based on the statistics of the types of part on the production line, there are 226 kinds of part, which means D = 226. Each plate selects several parts from the parts library for shape cutting and sorting. This work collects plate data during a certain period of time on the production line and sorts a total of 150 real plate data, including all types of part. Meanwhile, to supplement the experimental data further, 200 simulated steel plate data are generated by randomly selecting parts. The generality of the proposed algorithm is verified by using simulated and real data together. The link to the open source steel plate dataset sorted in the present work is https://github.com/tiantianhuanle/Sequential-optimization. The platform for the experiment was an Intel R Core TM i7-8565U@1.8 GHz processor, having 8 G of memory, no GPU acceleration, a Windows 10 operating system and the Python 3.8 programming language.  Table 3 show the effects of different methods on solving real data having different sizes. The input sizes of the validation dataset are 10 to 120 and the number in each validation dataset is 2048. Owing to printed space limitations, for test results on the simulation validation dataset see Table 2 in the online supplemental data, which can be accessed at https://doi.org/10.1080/0305215X.2021.2013479. The index to be solved is F(x). The percentage in the table is the rate of decrease of the corresponding method compared to the Random method. The down arrow means that the data is smaller than the first data in the same column, and vice versa. While some items in boldface represent that the data is the optimal value in the same column. The parameter settings of the related methods are shown in Table 2. Two points can be inferred from the tables: (1) PointerNet or the transformer trained on the datasets with different N have good performance, indicating that DRL has the advantage in general;

Effects of different methods
(2) when the input size of the validation data is small, the meta-heuristic algorithm has better performance. When N gradually increases, DRL starts to outperform the meta-heuristic algorithm. But the EA has achieved the best feasible solutions found for all N, which proves the advantages of the EA. In addition, the EA has a stronger optimization ability than the single DRL, which shows that the GA has also further improved the performance of DRL. In the EA, the two modules complement each other. Figure 5 shows the sorting process of a steel plate set with N = 10, using a random plate sequence. The final F(x) is ten, while Figure 6 shows the sorting process of the same data using a plate sequence given by the EA, and the final F(x) is eight.   Notes: The EA is compared with the latest algorithms, which include GPM (J. Liu and Li 2018) and CGA (Bye et al. 2021), in the optimization of sorting sequence. Similar to the present method, GPM uses a greedy algorithm to initialize all GA populations, so that GA has high-quality solutions in the initial stage of search, while CGA comprehensively uses a variety of effective crossover operators and mutation operators in solving sequence problems. It is to be noted that the EA still achieves the best performance. Although the optimization results obtained by GPM are not much different from those of the present authors' algorithm, GPM has more time cost because it uses a greedy algorithm to initialize the population. The termination condition of SA is that the maximum number of iterations has been executed. The termination condition of the GA, GPM, CGA and the EA are the same, which is that the maximum number of iterations has been executed or the optimal fitness value has reached the convergence state (where the convergence state is defined as that the optimal individual in the population does not improve for five consecutive times). Figures 7, 8 and 9 show the trend of three indicators of the deep networks in the RL training for which input size of the training data is ten.   It can be seen from the changes of the three indicators in these figures that the policy network is continuously optimized and converged, and the performance on the real validation dataset also proves the network's generality.

A2C
The total training time is about 91,680 s. The 50 steps correspond to one epoch, and the average time of each epoch is 91,680/350 ≈ 262 s. It can be seen from Figure 9 that it takes about 200 epochs, 52,400 s in total, to obtain a stable decision network from the A2C training method.

A3C
Owing to the time-consuming training of A2C, this work uses a multi-process and a multi-agent with an asynchronous update method (A3C) (Mnih et al. 2016) to train the network in order to accelerate the convergence further. Taking the transformer network as an example, Figures 10, 11 and 12 show the trend of three indicators of the network during the training with two and three processes. This shows that the network can converge more quickly as the number of processes increases.
As can be seen from Figure 9, it takes about 100 epochs to obtain a stable policy network from the A3C, a total of 26,220 s, which shortens the training time by about half.

The influence of different parameters on the GA
To explore the GA in the EA further, the variation curve of the average F(x) of the GA on the real validation dataset (N = 10) is drawn in Figure 13(a) when the two parameters, mutation probability and population number, of the GA are changed. Note that the average F(x) is continuously optimized with the increase of population number. The red curve (with a mutation probability of 0.1) basically achieves the best performance. This shows that the mutation probability doesn't need to be too large, which means the time cost of the exploration stage in the GA shouldn't be too large.

The effect of different methods on the GA population initialization
To explore the influence of different initialization methods on the subsequent GA search, this work studies five initialization strategies, i.e. randomly initialized, using the trained PointerNet, using the trained transformer, using the SA and using the trained PointerNet and the transformer together. Figure 13(b) shows the performance of the GA on the validation dataset (N = 60) under the five initialization strategies. The average F(x) of 44.80 is reached by the final initialization strategy. Compared with the GA alone, frame clearing has improved 1.33 times (about 2.88%), which is also a considerable improvement.
The GA has better time cost performance after the population is initialized with the trained network or SA. Among them, after using the transformer and PointerNet together to initialize the population, the GA has the lowest time cost. The strategy that uses the SA to initialize the GA population can also produce a good performance, but the time overhead of the SA reaches more than 10 hours, which is extremely time-consuming.
The comparative experiments verify that the EA can achieve the best optimization effect and time cost, which shows that it is helpful in accelerating the convergence of the search when the initial population in the GA has a better solution. Table 4 shows the overall time cost of the conventional GA and the EA in solving real validation datasets with different N. The unit is seconds. The iteration time of each population evolution of the GA is about 3.0 s.

Computational efficiency
When N of the validation dataset is large, such as 110, the time cost of the conventional GA is 4448.63 s, while that of the EA is 3102.73 s, which includes the running time of PointerNet and the transformer in parallel. It can be learnt from the table that the EA has obvious advantages in time cost compared with conventional GA. This is very cost-effective for scenarios with high real-time requirements in large-scale production.
The reason is that when the scale of the solution data gradually increases, the time cost required is also significantly increases due to the deterioration of the dimensionality. But thanks to the support of DRL, the EA can focus on the exploitation stage, which reduces the overall time cost. Notes: the down arrow indicates that the data is smaller than the first data in the same column. While some items in boldface represent that the data is the optimal value in the same column.

Conclusion
Optimizing sorting sequences can improve production efficiency. This article proposes a data representation method and a new evolutionary optimization algorithm for sorting sequences in heavy industry. The steel plate data is represented by analogy with the bag-of-words model. The EA uses DRL and a GA to optimize the objective function jointly. Related experiments have proved the validity of the data representation and the EA's advantages, which reduce the objective values by 12.54% and 11.43% in the real and simulated validation datasets. Simultaneously, it also has an advantage over a conventional GA in terms of overall time cost. As a next step, the upstream and downstream of the production line will be studied further, a digital twin system of the full scenario will be constructed and the scheduling strategies will be integrated to maximize production efficiency.