Real-time and Embedded Compact Deep Neural Networks for Seagrass Monitoring

We propose an efficient and robust segmentation network for automated seagrass region detection. The proposed network has a simple architecture to save computational demands as well as inference energy cost. More importantly, the scale of network can be feasibly adjusted, to balance the network computational demands and segmentation accuracy. Experimental results show that our proposed network is robust to segment the various seagrass patterns with 90.66% mIoU (mean Intersection over Union) accuracy. It had achieved 200 frames per second (FPS, 1.42 times faster than the second-best network GCN) on desktop GPU, and 18 FPS on NVIDIA Jetson TX2. It also has 3.45M parameters and 0.587 GMACs FLOPs (FLoating Point OPerations), only 14.6% and 10.8% of those in GCN respectively. To segment a single image on the Jetson TX2, our architecture requires an average energy of 0.26 Joule. This energy cost is only 46% of DeepLab, which shows the proposed network to be an energy efficient one. The proposed network demonstrates accurate and real-time segmentation capability, and it can be deployed to low-energy embedded AUVs for sea habitat protection.


I. INTRODUCTION
Seagrass are considered one of the most ecological important coastal habitats, supporting many marine species, regulating water quality, and providing a long-term carbon sequestration. Effective management of seagrass meadows is also fundamental for capturing pools of high value and climate change mitigation. For this AUVs (Autonomous Underwater Vehicles) can be used to monitor the health status of seagrass meadows on a regular basis without extra human assistance.
However, on board analysis poses many challenges. On the vision side, the water visibility conditions vary over time, geographical locations and even depths. The seagrass meadows themselves change color and texture over the year because of its natural life cycle. During the mission, the camera poses and distance with respect to the benthic habitats will vary as well. To accurately recognize and precisely segment them, the AUVs vision methods need to be robustly designed.

*Corresponding author
On the hardware side, the AUVs they rely on embedded electronics and batteries. They have so limited capability in terms of computing energy as memory and FLOPs and energy autonomy. The AUVs intelligence needs so also to be optimized to satisfy such strict hardware constraints. To address the above challenges, we propose a novel segmentation network, which is both accurate, compact and energy efficient. Its encoder-decoder structure is shown in Fig. 1. We considered that the network encoder usually requires the most FLOPs and parameters (see Table II). So, to reduce the network total FLOPs, we adopted the light-weight backbone MobileNet [16] as encoder. To further reduce the computational load, we then pursued a light encoder with as few convolution layers as possible.
Our main contributions are then as follows: 1) We proposed a segmentation network with a compact structure with only 3.45M parameters and 0.587G FLOPs (with 256x256 input resolution). Comparing with state of art networks such as SegNet, it allows a reduction of 99% on FLOPs. The size of proposed network is feasible to control by simply setting hyper-parameters, to enable compress the network in further.
2) We showed that our network still achieves state-of-the-art accuracy with 90.66% mIoU on a seagrass public dataset. The achieved accuracy is significantly better than the baseline seagrasses segmentation result of 80.9% reported in [1].
3) The proposed network trained on seagrass images taken at 0~2m seafloor distance, is able to infer images taken from 2~6m with 77.73% mIOU. The proposed network allows for changes of seafloor distance and visibility conditions. 4) We demonstrated, that, given input image resolution of 256x256 pixels, the proposed segmentation network achieves real-time seagrass segmentation on embedded platforms (Jetson TX2) at 18FPS with just 0.26 Joule energy consumption per inference. We showed so that such a network could be deployed on AUVs embedded devices with reduced energy needs.

II. RELATED WORK
With global coverage of around 3.45x105 km2 [4,12] and growing along most coastlines of every continent except Antarctica [6], seagrass meadows are one of the most effective carbon sinks on Earth. However human direct or indirect disturbances, including those from global anthropogenic changes, impact the seagrass meadows ecosystems [6,12] and threaten their capacity as key carbon capture agent. There is the clear need of frequently monitoring the seagrass ecology to assess possible meadow damage and assess its development over time to better plan, execute and validate conservation efforts. This can be achieved by taking underwater images of the benthic habitat and by measuring the seagrass meadow coverage and distribution to estimate key seascape multi-metric indices, such as the conservation index CI Index [8] and the seagrass habitat structure index [9].
Traditional machine learning methods for seagrass detection, with reduced or without human interaction, as proposed by [1,2,[10][11][12][13], can achieve accurate results. To start with, these methods separate a full resolution seagrass image into small patches or super-pixel regions. Then, from those, they extract the feature maps with various feature extraction and image processing methods: Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP) or recently deep Convolutional Neural Networks (CNN). Finally, by using different kinds of machine learning classifiers, such as Support Vector Machine (SVM) or Logistic Model Tree (LMT), those patches and regions are classified and then mapped to pixel-wise classified (or segmentation).
However, it is still difficult to deploy these methods on embedded electronics for real-time analysis on AUVs. To execute the multiple and complex processing steps for image processing, they in fact need many resources, as memory, and FLOPs. On the embedded boards, if they can at all, these traditional methods take so a long time to extract images features and inference the result. There is so the need for light, efficient and accurate algorithms for seagrass segmentation to be used on embedded platforms.
End2End deep learning (DL) networks can address the above segmentation problems. By easily leveraging GPU resources, they can perform faster end-to-end analysis than traditional methods. They also have demonstrated on AUVs [7,11,[13][14] as well. Yet, such proposed DL methods need a very complex architecture with five DL models to be inferred at the same time and then averaged. Such complexity poses a computational burden on the real-time inference performances. So even if such DL networks work well on desktop servers with powerful GPUs, it is challenging to run them on the limited computational resource of AUVs embedded systems.

III. APPROACH
As shown in Fig. 1, the overall architecture of our proposed segmentation network consists of an encoder network followed up by a decoder network. They are linked together through skipconnections. The encoder compresses the input image into multiscale feature maps by down-sampling; the decoder enables to reconstruct the same resolution segmentation by up-sampling. Skip-connections forward further spatial information and lowlevel features from the feature maps at different encoder layers to corresponding decoder layers. Below we look into these three components in details considering the computational constraints posed by the embedded systems.

A. Efficient Encoder
The objective is to determine a backbone network, which is structured as the encoder. It must keep the network parameter number and overall FLOPs as small as possible, while still be effective on extracting the feature maps. To flexibly address the trade-off between computation cost and network accuracy performance on embedded devices further, the encoder scale should be adjustable. This allows the network to be flexible and meet the specific energy and computation cost requirement on a chosen embedded platform.
Inspired by [19 -23] and as shown in Fig. 1, we employ the basic MobileNet [16] without classification layers as encoder. The encoder has nine convolution layers and blocks and downsamples the input image to 1/32 resolution feature maps. According to the theoretical analysis of [16], it can significantly reduce the convolution computing demands, by utilizing Inversed Residual Convolution in down-sampling blocks. MobileNet is capable of maintaining a high segmentation accuracy at a significantly reduced network parameters and FLOPs. In fact, assuming 256x256 input resolution, it has only 9.4% parameters of ResNet-50, and 15.1% of VGG-16, 7.5% FLOPs of ResNet-50, and 2% of VGG-16. In our experiments, as shown in Table II, the encoder of our architecture could take most proportion of parameters (64.3%)and FLOPs (69.5%). Therefore, utilizing an encoder with a small number of parameters and less computational demands poses a significant overall saving on RAM and CPU/GPU resources. Moreover, the encoder scale can be flexibly reconfigured by simply modifying the hyper-parameter α. We set the parameter α = 1.0 through the experiments.

B. Compact Decoder
We propose a new compact decoder structure for use on embedded devices. The decoder learns to recover the resolution from the down-sampled feature maps, and eventually predicts the pixel-wise seagrass segmentation of the input image with same input resolution. As represented by the brown layers in Fig. 1, the decoder comprises five up-sampling blocks. More specifically as shown in Fig. 2, each block has the following structure: at first, one Convolution layer to reduce feature channels, followed Batch Norm and ReLU6 layers, then 2x bi-linear interpolation, and finally a Residual Convolution layer.
The motivation of using this simple symmetrical encoderdecoder architecture is to avoid the complex structure requiring extra computing resources [19,21]. Comparing to Pyramid Scene Parsing (PSP) [22] and Atrous Spatial Pyramid Pooling (ASPP) [23], we developed a compact structure which has only few channels to reduce FLOPs and parameter number. On the other hand, employing 1x1 Convolution layers at the beginning of upsampling to reduce the feature map channels (from both last forward and skip-connections) compresses the decoder and further saves FLOPs. The decoder FLOPs account for only 30.5% of the whole network, while the decoder parameters take 35.7%, as shown in Table II in section IV. In comparison, the FLOPs proportion of the decoder in DeepLab [23] with ASPP is 82%. Utilizing this simple structure we can compress the network further without losing accuracy. It helps also to deploy the architecture on embedded devices and achieve the goal of accurate and real-time analysis. Same as the encoder, the network allows to reset the number of channels in each decoder layer to set up the network size. Thus, we use a hyper-parameter β to scale the number of channels in each decoder layer; β is experimentally set to 1.0. The basic channel number in each decoder blocks is shown in Fig. 1.

C. Skipconnection
Skip connections, as shown by red arrows in Fig. 1, enable to directly transmit low-level features extracted from encoder to their corresponding decoder blocks. They help to reconstruct higher resolution prediction from feature maps and mitigate spatial information loss caused by up-sampling. We have also attempted and tested the refinement module in skip connection, such as [24]. However, experimental results showed that the simplest forward structure overperforms such refinements with still largely reduced computational demands.

IV. XPERINMENT AND RESULT
In this section, we compare the proposed network performance with the four state-of-art segmentation architectures of U-Net [20], SegNet [19], GCN [24] and DeepLab [23]. All the networks were trained on an NVIDIA RTX 2080Ti GPU. Section A presents the large public seagrass database used for the experiments. Section B explains the parameters and details of the network training. Section C compares the accuracy, speed and storage cost of the investigated models. Moreover, we show that the proposed model has good generalization when performing on the new data with a larger distance to the seabed. Section D investigates the impacts of different network parameters (α and β) on the accuracy and computational demands of the proposed network. Section E shows the performance and energy consumption of investigated networks on the embedded device.

A. The Seagrass Dataset
The public seagrass dataset [1] comprises 12682 seagrass images with 1920 x 1080 pixel resolution and is taken from different depths to the seabed. 6037 of them were manually labelled by experts with binary masks of seagrass and nonseagrass classes. Fig. 3     the dataset and their corresponding ground truths in different depth groups from the dataset As in the benchmark work by Reus et al. [1], we split 6037 labelled images into three groups: the training set of 4223 images, the test set of 1204 images and the validation set of 610 images. We used the two-evaluation metrics proposed in [15]: F-Score, which measures the double overlapping proportion of total pixels of image; mean Intersection over Union (mIoU), which evaluates the overlapping proportion of predictions and benchmark masks.

B. Network Training and Implication Details
We applied data augmentation methods to increase data diversity, since it makes the network more robust on varing seagrass data patterns. In particular, we performed random image colour jitter, data random rotation, random horizontal and flip for each training image. That leads to doubling the number of training images. Such data augmentation led to doubling the number of training images. The network takes input images with 256×256 resolution.
We use the ADAM optimizer with initial learning rate 0.001 and we set the batch size to 64. The number of training epochs is set to 100 and other training parameters are left as default. The cross-entropy loss function used to train the network is defined as the equation (3): where is the ground truth class label (1 for seagrass and 0 for non-seagrass) of pixel , and (̂) is the predicted class probability of the pixel , is the total number of pixels.
As in [1], we separately trained such networks and tested them on the 0m ~ 2m and 0m ~ 6m datasets separately. Table I shows their accuacy comparison in mean IoU and F-Score metrics, while Fig. 4 presents the predictions of seagrass segmentation result. As previously discussed, the 0m ~ 2m trained models were also tested on the 2m ~ 6m dataset, to test if they were robust enough to give the right predictions as well at seafloor distances larger than 2 meters.
We found that all four models achieved a high accuracy level around 91%~93% mIoU and 95%~97% at F-Score for 0~2m images, also 89%~91% mIoU and 94%~96% at F-Score for 0~6m images. This shows that seagrass segmentation is becoming increasing difficult with larger seabed distances, since at large distances the seagrass images are significantly different and less clear. Our fine-tuned network slightly overperformed other four models, when the model was trained and applied to the same depth group images in 0m ~ 6m depth, with 89.71 mIOU. At the same time, our model achieved the best robustness at different imaging depth. The model trained on 0m -2m depth images achieved the highest 77% mean IoU, and 87% F-Score on 2m ~ 6m images captured at low visibility and resolution conditions.   Anyhow, among all the networks, Ours has the smallest number of parameters (14.6% of GCN) and has the lowest computational demand (just 10.8% FLOPS of the GCN). Also by testing them all on desktop PC with a GeForce RTX 2080Ti GPU (test with image resolution 256x256), the proposed network Ours showed the fastest inference speed at 200FPS (1.42 times faster than the GCN).
Hence, while still achieving competitive accuracy with the state-of-the-art, the proposed segmentation network showed significant advantages on speed, on-board storage and computational cost.

D. Network Configuration Parameters and Their Impact on Network Performance
For a real-time embedded system, the trade-off between network performance (accuracy and inference speed) and network computation demands (memory, energy cost and FLOPs on embedded devices) should be balanced carefully. In this section so, we discuss the effects of the hyper-parameters, α and β, on network accuracy and computational demand, as FLOPs. Such hyper-parameters, α and β are respectively linked with the encoder and decoder sections of the network. By making them smaller, we decrease the number of channels in each layer, and as a result, compress the network, reducing its FLOPs. By adjusting α and β, we can so address the trade-off between network accuracy, performance and FLOPs. In fact, on one side, such compression can negatively affect network accuracy. Yet reducing the required FLOPs will help the network deployment on embedded devices.
To study how network accuracy and FLOPs depend on α and β, we build various networks by modifying one parameter, while keeping the other one fixed, and measure their mIoU accuracy. As from Fig 5, we find that when we expand either the encoder, by varying α from 0.5 to 1.2 with β = 1, or the decoder, by changing β from 0.5 to 1.2 with α = 1, the mIoU increases. Yet by decreasing such hyper-parameters, we can greatly reduce the required FLOPs without losing much on accuracy. For example, by choosing α to 0.6 and β to 1.0, the required FLOPs will be reduced by 64.3% with only 1.68% accuracy loss with respect to higher hyperparameter values. For better accuracy instead, we can choose α to 1 and β to 1.2, reaching so 90.66% mIOU with 0.634 GMACs FLOPs.

E. Speed and Energy Consumption on TX2 Embedded Device
We tested the speed and energy cost of these networks on the AUV embedded platforms. We considered the NVIDIA Jetson TX2 with 256-core NVIDIA Pascal™ GPU and 8GB RAM on the Orbitty carrier board. Either Bluetooth or Wifi were disabled while the monitor was disconnected, only the ethernet link was active to access the TX2. We ran all networks on onboard TX2 GPU with a 256x256 image as input. We metered the TX2 energy consumption with two Fluke-115 mustimeters, measuring the current and voltage in series and parallel with the TX2 energy input. While each network was running, we measured the TX2 energy consumption very 250 milliseconds for 100 samples and averaged the results. From such average, we subtracted the idle energy of the TX2 which was measured to be 1.98 W. The result was then divided by the measured FPS on TX2 GPU and we obtained EINF, which is the energy necessary to inference one input image.   Among all, the proposed network has the lowest energy consumption per image, EINF is 0.26 Joule, and it is only 46% of the second-best DeepLab. As indicated by its low standard deviation, it has also the lowest energy fluctuations. For the energy efficiency point of view, the proposed network is so the most adapted for deployment on embedded boards for AUVs use.

V. CONCLUSION
In this study, we proposed a deep learning network for seagrass segmentation, which is accurate, fast, tunable, compact and energy-efficient on embedded electronics. It has mIoU of 90.66% when training on images with 0-6m camera-to-seafloor distance, requires only 0.634 GMACs FLOPs, which is the smallest value among all the networks. For inference speed performance, on a desktop Nvidia 2080ti and embedded Jetson TX2 card, it inferences respectively at 196 and 18 FPS. At the same time, this network is very energy efficient requiring only 0.26J per inference frame on a Jetson TX2. Such energy requirement was just 46% of the ones of the second-best, DeepLab. Hence the demonstrated accuracy, speed and efficiency of the proposed segmentation architecture allow it to be deployed for realtime use on embedded electronics. It will enable the deployment of End2End segmentation tasks on embedded AUVs for live measurements at sea. Hence it will be a step for the autonomous and effective of survey and management of seagrass meadow and for the preservation of their carbon capture and ecological value.