LDCNet: A Lightweight Multiscale Convolutional Neural Network Using Local Dense Connectivity for Image Recognition

Deep convolutional neural networks (DCNNs) have made great contributions to the development of computer vision. Since the trained deep convolutional neural network (DCNN) models require a large amount of computing and storage resources to achieve high performance, it is usually difficult to deploy them on resource-limited systems. To address this problem, we propose a novel local dense connectivity (LDC) module to generate feature maps from cheap convolution operations. The LDC module constructs hierarchical and locally dense connections within a single layer. This construction promotes the reuse of features in network layers. As a result, it leads to the generation of multiscale feature maps and increases the receptive fields for each network layer. A basic architecture block called LDCBlock was designed based on the LDC module. By stacking this kind of block, we propose a kind of lightweight and efficient multiscale residual network named LDCNet. Moreover, to model the interdependencies between the channels of the LDC module and enhance the informativeness of diversified features, we also design a sparse squeeze-excitation (SE) module, which has fewer parameters and computations. Finally, the experiments based on CIFAR data sets, ImageNet data set, and a defect detection data set demonstrate that our LDCNet achieves competitive performance compared with the state-of-the-art models focusing on model compression.


I. INTRODUCTION
B EING an important type of deep neural network, convolu- tional neural networks (CNNs) have achieved remarkable success in computer vision, natural language processing and automated speech recognition, which also make significant contributions to the development of cognitive systems [1], robots [2] and computational neuroscience [3].As the network structures become deeper [4] [5] [6], better features can be learned to improve their performance.On the ImageNet classification task [7], the classical CNNs ResNet-50 [4] and VGG-16 [8] for image feature extraction and classification possess about 25 million and 138 million parameters, respectively, which need hundreds of megabytes of storage space and hundreds of millions of computing operations.Large amounts of parameters and computations are required for their forward propagation, which are limited in many application fields.Utilizing convolutional neural network (CNN) models in lowcost environments (e.g., robots [2] and edge-devices [9]) is very challenging due to the limited storage capacity, computing power and battery life.Therefore, our work focuses on building lightweight CNNs to reduce the number of parameters and computations while maintaining good performance.
Recent research [9], [10], [11] has shown that many parameters in CNNs are redundant and do not contribute to improving the accuracy and generalization ability of the model.For example, by discarding some redundant weights, ResNet-50 can reduce over 75% of its parameters and 50% of calculation time while maintaining normal functionality.However, it is often difficult to maintain the same level of accuracy after compressing the model.Therefore, there is ongoing research on network compression and network architecture design.
Network pruning [12], [13], [14] is an effective technique to avoid over-parameterization for CNNs, which is divided into structured pruning and unstructured pruning.The structured pruning techniques could eliminate the redundant convolution kernels, and unstructured pruning techniques could prune the redundant connections between neurons.Structured pruning is more hardware-friendly, and unstructured pruning needs the support of sparse matrix library.Mao et al. [15], Liu et al. [16], and Zhu and Gupta [17] induced sparsity in various connection matrices of deep neural networks by removing unimportant connections or parameters, thereby reducing computational complexity and disk storage space of the model occupying.
Quantization techniques [10], [18], [19] utilized a lower data bit width, such as 8 bit, 4 bit, and even with 1 bit, in the calculation program instead of the original 32 bit or 64 bit, which could reduce the complexity of models and the memory occupied by the parameters.
Knowledge distillation is a kind of network structure optimization method of transfer learning, which helps to train the student networks with less computation to achieve accuracy similar to teacher networks through the supervision feature guidance provided by teacher networks [20].
Different from constructing large-scale networks and compressing the pretrained models to reduce their redundancy, another kind of method focuses on directly designing a compact and efficient network structure.The network design methods have been popular due to their compact size and lower complexity, which could avoid complicated fine-tuning processes.To reduce parameters and computation complexity, we propose a lightweight network architecture, which is mainly composed of the local dense connectivity (LDC) modules and the sSE modules.As shown in Fig. 1, the proposed LDC module adopts depthwise separable convolutions to get abundant feature information in the same level layer.The contributions of this article are summarized as follows.
1) An LDC module is proposed, named LDC module, by constructing hierarchical dense connections within one single layer to generate multiscale feature maps.The designed module is embedded in the network structure of CondenseNet [21], which demonstrates that our module reduces the parameters and floating point of operations (FLOPs) and improves the performance of the network.
2) Applying linear operations, the connections among channels are changed directly from a dense pattern to a sparse one, and a sparse squeeze-excitation (SE) module is proposed, namely, sSE module, which saves some parameters and computations.Based on LDC module and sSE module, we design a novel basic architectural block named LDCBlock.We also design a lightweight residual network structure called LDCNet, which is constructed by stacking multiple LDCBlocks.
3) The proposed LDCNets are used as the backbone network to extract features of multiple scales.In addition, we utilize Dilated Encoder [22] and lightweight shared detection heads to improve CenterNet* [23].
The proposed CenterNet-tiny has a very competitive performance on the defect detection data set NEU-DET.The remainder of this article is organized as follows.In Section II, some related works of efficient network architectures and CNN detectors are introduced.In Section III, we illustrate the proposed network structures, which includes the LDC module, sSE module, LDCBlock, and LDCNet.In Section IV, a large number of comparative experiments are presented to verify the effectiveness of the proposed network model.Finally, conclusions and expectations are drawn in Section V.

A. Efficient Network Architectures
Although network pruning, quantization, and knowledge distillation could effectively compress the existing neural networks into lightweight networks, their performance depends on the given pretrained networks, which could not further improve accuracy and efficiency without changing the network architecture.In addition, the models based on the improved basic operations and architectures take into account both model accuracy and running speed, which could have less memory and lower energy consumption.Therefore, many recent works prefer to build compact deep neural networks directly.DenseNet [5] connected to each layer in a feedforward fashion, which could enable the training of deeper networks and improve the classification accuracy.To make the calculation as efficient as possible in CNNs, Szegedy et al. [24] adopted factorized convolutions and aggressive regularization in the context of the Inception architecture.Hu et al. [25] proposed the "Squeeze-and-Excitation" (SE) block, which could adaptively recalibrate channel-wise feature responses by explicitly modelling interdependencies between channels.This block is often utilized in constructing lightweight CNNs.Howard et al. [26] and Sandler et al. [27] utilized depthwise separable convolutions, inverted residuals, and linear bottlenecks to directly construct the lightweight CNNs, MobileNetv1 and Mobilenetv2.After that, Howard et al. [28] presented MobileNetv3 based on Mobilenetv2 and architecture search techniques.Zhang et al. [29] and Ma et al. [30] designed ShuffleNets by using channel shuffle for group convolutions.Ting et al. [31] utilized interleaved group convolutions to design IGCNet (IGCV1).By adopting interleaved structured sparse convolutions, IGCV2 was investigated in [32], and the number of each group convolution was sufficient to ensure the sparsity of the convolution kernel, which kept a balance between computational complexity and accuracy.Sun et al. [33] extended the structure of IGCV2 and presented a compact IGCV3 by introducing low-rank group convolution to replace the original grouping convolution.Han et al. [34] designed a novel Ghost module to generate more feature maps from cheap operations, which was utilized to establish the lightweight GhostNet.Bai et al. [35] proposed a lightweight statistic attention (SA) block, which could be conveniently inserted into existing deep neural networks.In addition, Huang et al. [21] combined dense connectivity with learned group convolution (LGC) to develop CondenseNet.In order to increase the reuse efficiency of features, Yang et al. [36] improved CondenseNet by utilizing sparse feature reactivation (SFR) and proposed CondenseNetV2.

B. Defect Detection
Defect detection is to locate and classify the possible defects in an image, which plays a crucial role in ensuring the quality of industrial products during manufacturing.At present, the object detection methods, such as Faster R-CNN [37], SSD [38], and YOLO series [39], [40], have been widely adopted in the field of defect detection [41], [42], [43].However, these traditional anchor-base detectors rely on a set of predefined anchor boxes.It is necessary to preset a large number of fixed ratio and scale anchors for ensuring high recall of detection results, which may affect the speed of calculation.The different anchors need to be set for different data sets, so preprocessing is complicated.Recently, anchorfree detectors have been popular, which are mainly divided into the methods based on joint representation of multiple key points [44], and the methods based on center point or region [23], [45], [46].Law and Deng [44] detected a pair of keypoints, the top-left corner and the bottom-right corner, and then it generated an object bounding box through the combination of corner points.In [45], a pixel-level prediction based on fully convolution was proposed to solve the problem of target detection.Zhou et al. [46] utilized keypoint estimation to find center points and obtained a bounding box by regressing the size of each object.The anchor-free methods may be more friendly for industrial applications because of their simple network structure.The anchor-free detectors only need to regress the target keypoints and sizes in feature maps with different scales, which greatly reduces the time consumed and computing power required.

III. PROPOSED METHOD
Considering the tradeoffs among parameters, computation complexity, and classification accuracy, we utilize the technique of depthwise separable convolutions to design the LDC module, which could generate multiscale low-redundancy feature maps.The LDC module and sSE module are constructed to further improve the inverted residual block, and a novel network architecture LDCNet is proposed.

A. Local Dense Connectivity Module
Chen et al. [47], [48], and Cheng et al. [49] utilized feature maps with different resolutions to improve the ability of extracting multiscale features.Gao et al. [50] designed a residual block with hierarchical residual-like connections to obtain multiscale features.Different from the methods in the above references, the multiscale ability means that the multiple available receptive fields are in the same output feature layer.The small receptive fields in the output feature maps are beneficial for recognizing small objects and emphasizing object details, while the large receptive fields are beneficial for identifying larger objects.A small network composed of two 3 × 3 convolution layers (stride = 1) in [24] was adopted to replace a single 5 × 5 convolution layer, which could keep the receptive field and reduce the number of parameters.DenseNet [5] utilized the dense connectivity to facilitate feature reuse in the network and substantially improved the parameter efficiency and the model accuracy.
Motivated by the above discussions, we propose an LDC module to generate multiscale feature representations at a more granular level.As shown in Fig. 1, the ordinary feature maps with n channels (n = s × w) in convolutional neural networks are split into s groups.The number of scale is s, and each scale has w channels.The first group of feature maps is sent to a convolution of smaller filters, whose output feature maps are connected to the next layer and a part of input feature maps of other smaller filters in the same layer.Similarly, each group of filters first extracts features from a group of input feature maps, and their output feature maps and another group of input feature maps are concatenated in the channel dimension and sent to the next group of filters.By concatenating feature maps from front groups, different receptive fields can be obtained within a single layer.Due to combination effects, these smaller filter groups in the same layer level are densely connected to increase the feature scales of the output feature maps.The input feature maps x are split into s feature map subsets denoted by x i , where i ∈ {1, 2, . . ., s}.The LDC module applies a single layer to convolutional neural networks, which can be described by (1) H(•) is a composite function of operations including 3×3 DWConv [26], 1×1 convolution, BN [51], and ReLU6 [27].c means that feature maps are concatenated in the channel dimension.The LDC module adopts 1×1 convolution to compress the number of channels, which can reduce some redundant feature maps.
In general, the number of feature maps generated by an ordinary convolutional layer is the same as that of the proposed LDC module in (1).So the LDC module could be integrated into existing neural architectures to reduce their size.FLOPs and parameters are further analyzed by employing the LDC module.For example, an ordinary feature map has n channels.The number of scales is s.H and W are the height and width of the output feature maps.Channel number of the output feature maps is c.The kernel size of convolution filters is k×k.The theoretical FLOPs and parameters of the LDC module are shown as follows:

B. LDCBlock
Taking the advantages of the LDC module, we introduce LDCBlock specially designed for small CNNs.The structure of LDCBlock is illustrated in Fig. 2, which mainly includes a 1×1 convolutional layer and an LDC module.The residual connection is applied between the inputs and the outputs of these convolutional layers in LDCBlock.The 1×1 convolutional layer is used to increase the number of channels.The BN and ReLU are applied after each convolutional layer, except that ReLU is not used after 1×1 convolution in the LDC module as suggested by MobileNetV2 [27].The LDCBlock with stride = 1 is adopted in Fig. 2(a).As shown in Fig. 2(b), the downsampling layer is implemented by a depthwise convolution with stride = 2 inserted before the first 1×1 convolutional layer.

C. Sparse Squeeze-Excitation Module
This section describes the sSE module in detail (see Fig. 3).SE [25] is a classical channel attention method utilized in CNN architectures and establishes the interdependence between the channels of feature maps to enhance their representation ability.The SE module obtains a specific channel descriptor through global average pooling (GAP) to squeeze space dependency, and then uses two fully connected (FC) layers and a sigmoid function to scale the input features to highlight the useful channels.When reusing features in CNNs, it is easy to have redundancy.To reduce redundant features, a linear operation on each intrinsic feature is used to refresh each feature map.Two FC layers in SE are replaced by a series of linear operations C , so the sSE module is proposed, which is defined as where X conc ∈ R C×W×H is the concatenated feature map in the LDC module.GAP is F gap (X) = (1/WH) W,H i,j=1 X i,j .C denotes cheap linear operations.σ denotes the sigmoid function.⊗ denotes elementwise multiplication.As a channel linear operation operator, the A sSE (X conc ) is applied to the concatenated feature maps X conc to make them more abundant.Finally, the refreshed feature maps X refresh are obtained.In addition, The sSE module can be plugged into the LDCBlock, which is shown in Fig. 2(c).

D. LDCNet
To construct an efficient architecture, we propose the architecture of LDC Network, LDCNet, by stacking multiple LDCBlocks.The configuration of the LDCNet is illustrated in Table I.Motivated by the architecture of GhostNet [34], the first layer is a standard 3×3 convolution with 16 filters.The different stages are defined according to the size of input feature maps, which utilize the different LDCBlocks in Fig. 2. The LDCBlock is applied with stride = 1 except that the last one in each stage is with stride = 2.The sSE module is also applied to refresh feature maps in LDCBlocks as in Table I.Finally, a GAP and a 1×1 convolution are adopted to convert the feature maps into the 1280-D feature vector for final classification.

E. CenterNet-Tiny
We improve CenterNet [46] with the proposed LDCNet, FPN [52], and Dilated Encoder [22], and then obtain CenterNet-Tiny, as shown in Fig. 4. We add two 3×3 convolutions after the output feature maps of path aggregation network for the keypoint prediction and the regression branches.Additionally, the channels of feature maps are compressed from 256 to 96.The training loss function is defined as follows: where L cls denotes focal loss [53], and L reg denotes the GIOU loss [54].N pos is the number of key points of the image.The value of λ is selected as 2, which is the balance factor between L cls and L reg .Pxyc is the predicted keypoint, and

IV. EXPERIMENTS
To compare Parameters, FLOPs, and classification accuracy, we evaluate the performance of LDCNet on CIFAR data sets and ImageNet data set compared with Mobile networks, IGCV networks, and other state-of-the-art compact models.On the NEU-DET data set, the efficiency of LDCNet as a backbone network is validated for the defect detection task.

A. LDC-CondenseNet Architecture Configurations
As shown in Table II, the following network configurations are utilized in the experiments on the CIFAR data sets.The network architecture adopts growth rates k ∈ {8, 16, 32} to ensure that the growth rate is divisible by the number of groups.As shown in Figs. 5 and 6, LGC [21] is applied to the first 1×1 convolutional layer of each basic layer with a condensation factor of C = 4.During the training process, 75% of filter weights are gradually pruned with a step of 25%.

B. Analysis on Hyperparameters
As shown in Fig. 1 and Section IV-A, an essential factor named scale s is introduced into the LDCNet strategy, and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the LDCNet with scale dimension s is denoted as LDCNet-ss.Although the given architecture in Table I can keep accuracy with few parameters, we may require smaller models or higher accuracy on specific tasks in different scenes.Referring to [30] and [34], the factor γ is designed to adjust the number of channels, which changes the width of the network model.This factor γ is called width multiplier, and we denote LDCNet with width multiplier γ as LDCNet γ ×.Smaller γ leads to fewer parameters and lower performance, and vice versa.

C. Data Sets and Training Settings
CIFAR Data Sets: The CIFAR data sets [55], CIFAR-10 and CIFAR-100 contain 60 000 RGB images of size 32 pixels, which are with 50 000 images for training and 10 000 images for testing.
The standard data augmentation methods [4], [11], [32], [33], [56] are adopted in this article.These images are filled with four zero-pixels on each side and then randomly cropped to produce 32 pixels.After that, these images are followed by horizontally mirroring and normalizing.During training, stochastic gradient descent (SGD) is adopted to train CNNs in the experiments for 300 epochs from scratch.64 images are trained each iteration on a GPU, RTX 2080ti.To avoid over fitting for the proposed models, the dropout rate being 0.2 is utilized before the last FC layer [57].
ImageNet Data Set: We also evaluate the proposed LDCNet on the ImageNet ILSVRC 2012 image classification data set [7], which contains 1000 visual classes, a total of 1.2 million training images, and 50 000 validation images.We follow the default training settings in [29], except that the initial learning rate is with 0.4 and a batch size is with 512 on four GPUs.The results are reported with top-1 and top-5 performances on the ImageNet validation set.
NEU-DET Data Set: NEU-DET [58] is a surface defect detection data set, which includes six types of defects, e.g., crazing, inclusion, patches, pitted surface, rolled-in scales, and scratches.Each class has 300 images.We divide the NEU-DET data set into training set and test set, which contain 1260 images and 540 images, respectively.Images are resized 640 pixels.The models are trained from scratch for 27k iterations.The initial learning rate is 0.01 and learning rate decay rate of 0.1 in 18k and 24k iterations.Batch size of per iteration is 16 images.The same default settings are used in training as CenterNet* [23].
We compare our LDC-CondenseNets with models obtained by state-of-the-art pruning techniques [13], [14], [16], [59].In terms of FLOPs, LDC-CondenseNet-86-s2 is about three times less than VGG-16 or ResNet-110 pruned by the method in [13], and it utilizes about half the number of parameters or FLOPs to achieve comparable performance as the most competitive baseline, the ResNet-164-B in [16].In addition, LDC-CondenseNets also perform better than the knowledge distillation method ShuffleNetV2-EKD [20] on CIFAR-100.

F. Ablation Study
For comparison, the LDCNet-s2 and other customized LDCNet architectures with different scales are proposed.For scale s = 3, 4, although the parameters and FLOPs are reduced, the results presented in Table IV demonstrate that test errors tend to decrease as the scale dimension s increases.
Correspondingly, when we increase the width of scale w, the classification error rate of LDCNet-s5 with less parameters and FLOPs is lower than LDCNet-s2 and LDCNet-s3.Although LDCNet-s2 1.2× has more parameters and FLOPs, its classification accuracy is lower than LDCNet-s5.We can find that the increasing scale is more effective than the increasing width of the entire network, which proves that the performance gain yielded by our architecture is not due to the increased number of parameters.This result is in line with our analysis in Section III, which shows that LDC module gives the training algorithm more flexibility to generate multiscale feature maps while keeping low-redundancy.We also visualize the feature maps of our LDC module as shown in Fig. 10.Although the feature maps generated by LDC module are still a little similar, they have significant differences, which means that the generated features are abundant enough to satisfy the recognition tasks.

G. Results on ImageNet
On the ImageNet classification task, several efficient network architectures are chosen as competitors.The results are summarized in Table V. Considering FLOPs and parameters, the experimental results show that LDCNet-s2 outperforms MobileNetV1 1.0× and MobileNetV2 1.0× with about 1.0% gain in accuracy, which demonstrates the effect of our LDC module.The classification accuracy of LDCNet-s2 is slightly less than GhostNet 1.0×.The proposed LDCNet could actually serve as a kind of basic CNN architectures, whose LDC module's scale s and width w are hyperparameters of search space for neural architecture search (NAS) [28].In addition, the more compact and high-precision models could be created by applying platform-aware NAS [70].

H. CenterNet-Tiny From Scratch
We validate the efficiency of LDCNet as a backbone for an anchor-free object detector.As shown in Table VI,

V. CONCLUSION
In this article, we present the LDC module by constructing hierarchical dense connections within one single layer and introducing cheap convolutional operations to generate multiscale feature maps, which could reduce the computational costs of existing CNNs.We present a sparse SE module to selectively refresh the features containing useful information through global information.We also design a lightweight neural network architecture named LDCNet by adopting the proposed LDCBlock, which has less parameters and FLOPs while remaining the comparable performance.
Compared with the other efficient network architectures, such as FE-Net, IGC Networks, and MobileNets, the experimental results show the better performance of LDCNet and demonstrate that LDCNet achieves a better balance among FLOPs, parameters and classification accuracy.The proposed methods are evaluated on image classification and defect detection tasks, which will be deployed in embedded hardware platforms and applied to other tasks in the future.

Fig. 1 .
Fig. 1.Detailed view of LDC module with channel width w = 2 and the number of scales s = 4.The split means equally splitting in the channel dimension.c and concat means concatenating features in the channel dimension.DWConv represents depthwise convolution.1×1 Conv represents 1×1 convolution.This module could replace convolutional layer to improve existing neural network structures.ReLU6 is one kind of activation function ReLU, which limits the activation to a maximum size of 6.

Fig. 4 .
Fig. 4. Network architecture of CenterNet-tiny.F3-F5 denote the feature maps of the backbone network.P3-P7 are the different feature levels used for detection head.↑ and ↓ denote upsampling and downsampling, respectively.H × W is the height and width of feature maps.'stride' (stride = 8, 16, 32) is the downsampling ratio of the feature maps compared with the input image.
xyc is the labeled keypoint for classification.The subscript xyc represents all coordinate points on all heatmaps (c represents the target category, and each category has a Heatmap).As the regression targets, the 4-D real vector d xy = (l, t, r, b) is utilized to get location, and l, t, r and b are the distances from the location to the four sides of the bounding box.dxy is the regression prediction for each keypoint.1 {P xyc >0} is the indicator function, which is 1 if P xyc > 0 and 0 otherwise.At inference time, the predicted keypoint values Ŷx i y i c are detection confidence, and the regression targets l, t, r, and b for each keypoint location (x i , ŷi ) produce a bounding box.

Fig. 6 .
Fig. 6.Illustration of learned group convolutions with G = 2 groups and a condensation factor of C = 2 [21].The fraction of (C − 1)/C connections are removed after each of the C − 1 condensing stages during training.As this figure shown, 50% of filter weights are pruned during training.The index layer rearranges these feature maps during testing, which could be used by standard group convolutions.

Fig. 10 .
Fig. 10.Left: visualization of some feature maps generated by the first 3 × 3 convolution, where three similar feature map pair examples are annotated with boxes of the same color.Right: the feature maps are obtained after the transformation of LDC module with s = 4 corresponding to Fig. 1.

TABLE I DETAILED
ARCHITECTURE LDCNET.#EXP DENOTES EXPANSION SIZE.#OUT DENOTES THE NUMBER OF OUTPUT CHANNELS.#SSE MEANS WHETHER USING SSE MODULE

TABLE II LDC
-CONDENSENET ARCHITECTURES FOR CIFAR DATA SETS

TABLE III COMPARISONS
OF CLASSIFICATION ERROR RATE (%) ON CIFAR-10 (C10) AND CIFAR-100 (C100) WITH STATE-OF-THE-ART FILTER-LEVEL WEIGHT PRUNING OR KNOWLEDGE DISTILLATION METHODS.-REPRESENTS NO REPORTED RESULTS AVAILABLE LDCNets also have competitive performances compared with other mainstream lightweight networks.For example, the classification error rate of LDCNet-s2 0.5× with 1.19M parameters is 22.82%, while WRN-32-4 reaches 23.55% with 7.4M.ANTNet (g = 2) reaches 24.30% with 2.2M.Swapout Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE V COMPARISON
OF STATE-OF-THE-ART SMALL NETWORKS OVER CLASSIFICATION ACCURACY, THE NUMBER OF PARAMS AND FLOPS ON THE IMAGENET DATA SET