Multilabel Classification of Heterogeneous Underwater Soundscapes With Bayesian Deep Learning

Underwater soundscapes of coastal zones close to human settlements are heterogeneous in nature. Multiple ships and biological sources are often simultaneously present in the passive sonar vicinity. The classification of such heterogeneous underwater soundscapes is a challenging task for humans as well as machine learning systems. In this article, a Bayesian deep learning approach is proposed that can accurately classify multiple ships simultaneously present near the sensor (multilabel classification) and provide uncertainty in the classification. This is achieved by assuming a Bayesian formulation of standard convolutional neural network architecture to not only assign multilabels per inference but also to provide per inference uncertainty. By utilizing almost 3500 h of passive sonar data (spanning more than a year of sensor deployment) labeled through automated fusion with automatic identification system information, both multiclass and multilabel classification tasks of ship-generated noise are addressed. The best performing Bayesian architecture on the multilabel task achieves a weighted $F^{1}$ score of 0.84, where each prediction is accompanied by a measurement of uncertainty, which is used to further enhance the understanding of model predictions.

The classification of underwater soundscapes is of interest to several communities, including biologists and oceanographers who look to study fish and whale populations through recordings from the underwater environment [1]- [3].Shipping noise can have adverse impacts on marine mammal populations.The measurement and modeling of shipping noise is important in predicting environmental impacts.Such research aids autonomous monitoring of fisheries and fishery enforcement by government and environmental groups [4].Ships, submarines, and unmanned underwater vehicles can use passive sonar classification systems to aid in the identification and tracking of contacts, to help maintain safety of navigation, to aid in the real-time interdiction of illicit activities (such as smuggling and covert vessel transits), and to provide port security [5], [6].The use of machine learning algorithms for the classification of underwater sounds is well established [7].Most research in this area, however, focuses on identification of biological sounds [1], [3] with considerably less reported research on man-made or ship sounds.
This lack of research is partly due to the fact that the datasets used in ship classification tasks are often limited, either in size or in similarity to real-world conditions.Zak used sounds recorded from just five naval vessels to demonstrate the use of self-organizing maps and neural networks to classify ship sounds with greater than 70% accuracy [8].Santos-Domínguez et al. report using only two hours of recordings [9], and Niu et al. use just three ships with 30 minutes of recording from each ship [10].Berg et al. [5] and Neilsen et al. [11] both use synthetically generated samples for training due to a lack of real-world data.
An overall lack of data also affects the quality of results by reducing the diversity of conditions in which ship noise is recorded.A ship on the ocean creates acoustic signals from operating machinery, propeller cavitation, and the motion of propeller shafts and reduction gears [12].
Vibrations of operating engines and pumps are transferred through the hull into the water, creating a distinctive pattern of sound that can be detected by a hydrophone.The size, speed, and aspect to the sensor all affect the type and strength of signals received [12], as do oceanographic conditions such as temperature, salinity and pressure (primarily a function of depth) [13].These conditions change regularly depending on factors such as weather, time of day, and time of year.
Arveson and Vendittis provide an overview of the sound sources and source levels that are generated by a bulk cargo ship [14].McKenna et al. examined recordings of multiple commercial ships which show that the sound from container ships predominately falls below 40 Hz and that all ships showed asymmetry in their signatures, with bow aspect radiated noise lower than stern aspect [13].These studies illustrate some of the challenges of automatic classification of ships, including differences in emitted noise from the same ship due to changes in equipment use, variable water conditions that can change how emitted sound from the same ship is picked up by the receiver, and changes in ship aspect and/or range relative to the receiver.So far, underwater soundscape classification tasks have been treated as acoustic event classification, in which a sample contains a single acoustic event to be labeled with one out of a number of possible classes (this is known as multi-class classification) [1], [2].This approach, however, is an inaccurate representation of the heterogeneous underwater acoustic environment where multiple ship signals are often simultaneously present.Machine learning models trained on a multi-class classification task will provide a single label to the input data stream and will miss labeling any other ships present in the audio sample.The ability to demonstrate underwater soundscape classification on multiple, simultaneous ships using a single element hydrophone (measuring scalar pressure only) and provide an uncertainty measurement for those estimates has remained a challenge for the community at large.This paper addresses that challenge by using a multi-label classification model, which has the ability to assign one or more labels to an individual sample.In contrast to the common approach of rare acoustic event classification, here the goal is to detect multiple target labels per inference (per sample) of the neural network classifier.Similar to the Google YouTube8M challenge [15], this is achieved by developing and evaluating a multi-label convolutional neural network (CNN) architecture.To address the lack of uncertainty measurement in the classification estimates, a Bayesian Deep Learning (BDL) approach is adopted.BDL blends deep learning with Bayesian theory to enable models that provide uncertainty measurements and are more robust to overfitting relative to the deterministic (classical) neural network architectures [16].Uncertainty measurements allow for a deeper understanding of the model's predictions [17].
In this work, BDL model architectures are developed to not only establish the link between the ship acoustic signature and the classification ontology adopted, but also to estimate the uncertainty in the classification.Both deterministic and Bayesian configurations of deep Residual Networks (ResNet) model [18], [19] architectures and a custom CNN architecture are adopted and benchmarked on the task.The advantage of the Bayesian architecture over its deterministic counterpart is outlined and uncertainty of inference of ship classification is presented as a unique improvement of Bayesian architectures for underwater soundscape classification applications over deterministic counterparts.Moreover, the large size of our dataset, with more than 4,000 unique ships and over 3,400 hours of labeled audio data, enabled us to study the impact of seasonal variations of sound speed profiles (SSPs) on the bias and quality of classifications of developed deep learning models.Additionally, we study the quality of the measured uncertainty through a use-case study and correlate the uncertainty of classification to distance from the sensor and the bow-stern orientation.

II. DATASET
The dataset used for model training and evaluation was recorded at Thirty Mile Bank off the coast of southern California from December 2012 to November 2013, totaling more than 6,800 hours of recordings with 4,470 unique ships recorded.The sensor, a High-frequency Acoustic Recording Package (HARP), was deployed in 734 meters of water with the sensor 51 m above the sea floor and an original sample rate of 200 kHz [20].Recordings were downsampled to a 4 kHz sample rate for labeling and model training [12].
In parallel to the HARP deployment, we used automatic identification system (AIS) data to develop datasets for both multi-class and multi-label tasks [12].Given that there is no formal ontology of ship sounds, in order to utilize the AIS stream this research expands upon the ship ontology described by Santos-Domínguez et al. [9], in which ships are divided into four classes based upon size and one class is given to samples without ship sounds (see Table I).Having a no-ship class enables the development of a flat-classifier instead of using a multi-level classifier, which would utilize a detector of ship presence followed by the classification algorithm.For the task of multi-class classification, 30 second audio samples were only assigned one label indicating which class of ship was present based on AIS messages [12].In contrast, for the multi-labeled dataset, 30 second audio samples with more than one ship present were labeled with the class labels of all ships present at that time.Ship class was determined by matching audio data segments based on timestamps with AIS messages.Specifically, broadcast Maritime Mobile Service Identity (MMSI) numbers (or International Maritime Organization (IMO) numbers where MMSI number could not be found) allowed for finding precise ship details in an online ship database [21].For both datasets, a ship was deemed present if within 20 km (10.7 NM) of the sensor; time periods where all ships were outside 30 km from the sensor were labeled as the noship class.For the multi-labeled dataset, samples with more than one ship present were labeled with the class labels of all ships present at that time within 20 km [12].Only 33% (136,044 of 415,951 samples) of the dataset contained samples with more than one ship present, which is a common data imbalance in multi-label classification for audio [15].[22] and are related to the linear-frequency spectrogram, i.e., a Short Time Fourier Transform (STFT) magnitude.They are obtained by applying a mel-filterbank over STFT magnitude which effectively summarizes frequency content with fewer dimensions [23].The mel-filterbank emphasizes details in lower frequencies, which were proven to be important in underwater soundscape classification, and deemphasizes higher frequency content which generally does not need to be modeled with high fidelity [13].

Class
Specifically, in this work mel-log spectrograms were computed through a STFT of 30 second labeled samples with a 250 ms frame size, 75 ms frame hop, and a Hann window function [12].We transform STFT magnitude to the mel-scale using a 128 band mel-filterbank followed by log compression of the signal [23].Labeled samples for both tasks were further split into training/validation/test data with an 80/10/10 ratio, respectively.
In Fig. 1 we visualize mel-log spectrograms for several examples, including having only a single target present and having two targets present at the same time.The color bar represents a dB down-scale.Relationship between mel scale and frequency [23]  The US Navy's Generalized Digital Environmental Model (GDEM) product database provides global, gridded, steady-state ocean temperature and salinity profiles [24].Monthly temperature and salinity profiles were extracted at the nearest GDEM point (32.5N117.75W) to the HARP's location (32.666N 117.707W).These were used to derive SSPs [25] over the 12-month deployment period, see Fig. 2.
In order to evaluate the impact of environmental parameters on the performance of the trained models, two studies are conducted involving different data splits for the multi-class classification task, from December 2012 to March 2013 and from December 2012 to November 2013.From Fig. 2a, from December to March low dispersion of the SSPs is observed, while for the overall time segment in Fig. 2b dispersion between the SSPs is significantly increased.

A. Bayesian Deep Learning
BDL combines the ability of Bayesian probabilistic models to provide uncertainty in predictions with the ability of neural networks to recognize patterns and relationships [17], [26].Specifically, model uncertainty is measured by placing a prior probability distribution over the model's weights in order to construct a Bayesian CNN (BCNN) [27].Given a supervised learning setting and a training dataset, D = {x n , y n } N n=1 , where N represents the dataset size, x n represents an input feature vector (where In practice, because inference defined in Eq. 1 is intractable due to calculation of the probability 167 distribution p(w | D), an approximate inference is used.

168
In this work, we evaluate variational inference (VI) approaches [26], [28], [29] that ap- approximate distribution needs to be as close as possible to the posterior distribution.A common approach in information theory is to measure proximity (or similarity) between two distributions via Kullback-Leibler (KL) divergence [30].Therefore, we minimize the KL divergence between two distributions: Minimizing the KL divergence in Eq. 2 is equivalent to minimizing the negative evidence lower bound function (ELBO) [16], [28], [31] relative to θ: where the first term represents the expected likelihood, which "describes how the variational distributions of the neural parameters explain the observed data," [32] and the second term is KL divergence measuring proximity between the posterior and prior densities.This cost function definied by Eq. 3 is minimized in a mini-batch stochastic gradient descent fashion during neural network training to find the optimal value of θ which defines the parameters of the distribution over weights.
In order to optimize the lower bound with respect to variational parameters θ and model parameters w, the expectation term in Eq. 3 must be differentiated with respect to both of these variables.The problem of computing the gradient of an expectation of a function with respect to the parameters of the distribution is addressed by Monte Carlo gradient estimation [33].However, the standard Monte Carlo gradient estimator of the variational lower bound with respect to the variational parameters, θ, exhibits a level of variance that is too high to be practical for deep learning purposes [28], [34].Proposed solutions are based on the reparameterization of q θ (w) such that the samples generated from the reparameterized approximate posterior yield a lower variance [28], [30], [31].Some of the well-established solutions that are part of major toolboxes such as Tensorflow Probability (TFP) [35] include techniques such as the local reparametrization trick (LRT) [30] and flipout estimators [31].While both estimators present an improvement over standard variational inference, the flipout estimator, given several assumptions and a trade-off in computational complexity, can yield decorrelated stochastic gradient estimates on a mini-batch that exhibit less variance than the estimated mini-batch gradients computed via an LRT approach [31].Intuitively, from an optimization perspective, decorrelation leads to better conditioning of the Hessian for updating the weights by whitening the external and internal network-layer inputs.In our evaluations we used a default TFP implementation of 2D Convolutional flipout layers which assumes a Gaussian distribution with variational parameters θ = (µ, σ), variational distribution q θ (w) = N (µ, σ 2 ) and a prior with p(w) = N (0, 1).
Given that a flipout estimator effectively doubles the number of parameters of the deterministic neural network layer, we evaluated a simpler method that does not explicitly model distributions over weights called Monte Carlo (MC) Dropout [16], [27], [36], [37].MC Dropout is based on a standard neural network regularization technique called dropout which was introduced by Srivastava et al. [38].Dropout is applied as a layer in neural networks where it zeros out a random subset of the weights in the preceding layer during training only.It was shown [16], [39] that the set of mean weight matrices (L layered neural network) and dropout probabilities (variational parameters) for a dropout distribution satisfies θ = {M l , p l } L l=1 , such that q θ (w) = l q M l (w l ) for a single random weight matrix w l of dimensions K l+1 by K l .Further, the KL term of Eq. 3 can be approximated as shown in Eqs. 4 and 5: where H(p) the entropy of a Bernoulli random variable with probability p [39] and B is the constant term related to the number of mini-batches in optimization.The entropy term is constant with respect to model weights and, given it does not affect the optimization, can be omitted when the dropout probability is not optimized.To calculate KL divergence, one only needs to evaluate the probability of the dropout, p, and a second norm on the weights M 2 .
It was shown by Gal and Gharmani [16] that any neural network with standard dropout and L2 regularization is equivalent to variational inference with the assumption that the dropout is active in inference.Explicitly optimizing the dropout probability in the entropy term leads to a different technique named Concrete Dropout [39] which we have not evaluated in this work.
We wanted to point out that the log-likelihood term of the ELBO loss function, see Eq.3, can be either calculated in an explicit manner or implicitly takes the form of categorical-cross entropy for multi-class classification with a softmax activation function on top of the flipout and MC Dropout architectures [40].Similarly, for multi-label classification, the log-likelihood term takes the form of the binary-cross entropy with sigmoid activation.For readers interested in implementation details, Filos et al. [36] hosts a code repository that includes details of both flipout and MC Droput methods and we followed their implementation approach.Additionally, Google LLC hosts a code repository, see Nado et al. [41], with current state-of-the-art benchmarks in Bayesian Deep Learning, including methods presented in this manuscript.
For a trained neural network, with weights ŵt , prediction uncertainty is induced by the uncertainty in weights and can be calculated by marginalizing over the approximate posterior distribution q θ (w) using Monte Carlo integration [16] with T samples to calculate mean predictive probability from Eq. 1: where ŵt ∼ q θ (w) and c represents the true class (e.g., "Class A" or "Class D").Further, final classification is assigned based on Eq. 6 by assigning the class based on the highest mean predictive probability.It is important to clarify that, in inference, prediction is repeated T times for each input to the trained neural network for both flipout and MC Dropout.Intuitively, with every prediction for the given input a different set of weights is sampled from the model and an ensemble of predictions is produced, with final prediction being an ensemble average.For further clarification, in contrast to dropout for regularization during training, in the MC Dropout approach, dropout is active with probability p during inference as well.
Input mel-log spectrograms were classified from Eq. 6 by assigning the class based on the highest mean probability, argmax pc .The uncertainty of BCNN classifiers is quantified by predictive entropy and total variance [16], [36].Predictive entropy is a well-established uncertainty metric that measures the average amount of information contained in the predictive distribution and is given by: H p can be normalized to fall between zero and one by dividing by log 2 C (which comes out to C, the number of classes) [42] as shown in Eq. 8.

B. Architecture Choices and Tasks
The popularity of CNNs is due to the state-of-the-art performance that these model achieve in large scale image recognition tasks [43].A desire to train deeper neural networks, potentially improving performance further, led to the development of residual connections introduced by He et al. [18], who demonstrated the ability to train deeper convolutional models than previously possible.We utilize these ResNet architectures in parallel to the custom CNN architecture to train and develop deterministic and BDL models for classification.
In this work, we focus on two classification tasks, multi-class classification and multi-label classification.Multi-class classification assumes a multinoulli probability distribution since one wants to represent distribution over c classes.This is typically achieved by having c neurons in the last layer of the neural network and applying a softmax activation function, pct = sof tmax( ft (x * )), see Eq.6.
Multi-label classification is typically formulated as multiple binary classification problems when using a negative log likelihood loss (cross-entropy loss) [44].A similar approach was followed by Hershey et al. [15] for audio classification using c sigmoid activation functions (σ) one over each of the c output neurons: where ft (x * ) ∈ R c .This is a common approach in image multi-label classification [45], [46].
The choice of task drives the choice of activation function on the output of the neural network model, however, the number of output neurons is constant across the architectures.
Multiple labels can be predicted when the individual probabilities on the output neurons are greater than the probability threshold, in this work set at 0.5 [40], [44].The threshold was not tuned to maximize any specific metric, such as F 1 score or to reduce the false-alarm rate.Since our ontology includes all ships and "no ship" as classes, in the case where none of the classes meet the threshold, we select the predicted class in the same manner as the multi-class classifier described above.In the case where both the "no ship" class, Class E, and at least one other class both meet the threshold, if the probability for Class E is greater than those for any other class, the model predicts Class E and nothing else.If at least one of the ship classes has a greater than or equal probability than that for Class E, the model predicts every ship class that meets the threshold and not Class E.
For ResNet [18] architectures, we tested standard ResNet32V1, ResNet20V1 and ResNet8V1 model configurations as a deterministic baseline that we adapted to Bayesian configurations following suggestions in Tran et al. and Dillon et al. [35], [47] Typically, CNNs that focus on image classification use square kernels of size 3x3 or 5x5; where larger kernels are considered inefficient due to computational requirements [40].Assumptions about image orientation invariance do not transfer to spectrograms derived from time-series audio data [12], and multiple studies have explored the use of rectangular kernels in audio classification.Several studies use rectangular kernels in music classification [48].Mars et al. use rectangular filters of various sizes to vary the convolution of time and frequency domains [49].In order to adopt these ideas, rather than adjusting a ResNet architecture, a custom CNN architecture is proposed as shown in Fig 3.
Through a hyperparameter search of kernel size ratios (1:1, 2:1, 3:1, and 4:1), using a kernel size of 5x5 as a baseline, it was found that a 2:1 ratio is optimal and a 4:1 ratio performed the worst [12] based on the model accuracy as the metric.Kernel ratios of 2:1 and 3:1 performed almost identically (with accuracy of 0.89) where a 4:1 ratio had a drop in performance (accuracy of 0.87) with a 1:1 kernel being in the middle (accuracy of 0.88).We chose 2:1 over 3:1 as optimal given that it introduces a lower number of parameters to the network architecture.
Hence, the final proposed kernels of 10x5 were used to apply the convolution operation on time vs frequency mel-log spectrograms.Given our non-exhaustive ablation study for the kernel ratios, we did not change ResNet architectures and left them with the original isotropic kernel configurations of 3x3 as a fair and unbiased comparison throughout the manuscript.These kernel sizes are fixed throughout every layer of the network.A batch normalization layer is used to normalize input mel-log spectrograms.Based on work by Ozyildirim and Kartal [50], an increasing number of filters is used throughout the network.The initial layers contain 16 filters with 16 added in each additional set.After each block of two convolutional layers, the input size to the next block is cut in half by a max pooling layer with a stride of 2 by 2. L2-regularization on CNN layers is used to prevent overfitting [12] and to satisfy MC Dropout, see Eq. 5.
Fig. 3. Proposed Model Architecture: Each block is described by (number of filters, filter shape) [12].

C. Training Protocols
All of the model architectures evaluated were developed using the same training strategy for the fairness of benchmarking.Adam optimization was used with a starting learning rate of 0.001 and default values for beta parameters (0.9, 0.999).We employed learning rate annealing [51] via monitoring validation accuracy such that the learning rate was reduced by a factor of 10 if the validation accuracy was not increasing for 50 consecutive epochs.To regularize for overfitting, an early stopping strategy was utilized [40].Overall, training was terminated at 500 epochs.
An MC Dropout model was used with a dropout probability of 0.3.Using a non-exhaustive grid parameter search, the best observed L2 regularization on convolutional layers was determined to be 0.001, and this value was used throughout all of the deterministic and MC Dropout layers.This is similar to Filos et al. [36].For the flipout and initialization of the posterior, we followed TFP [35] recommendations for Bayesian ResNet architectures and have initialized convolutional kernel posteriors with µ=-9.0 and σ = 0.1.NVIDIA RTX 8000 48GB GPU graphics cards were used for distributed model training and model inference.

A. Metrics
In traditional binary classification tasks, standard metrics such as accuracy (Acc), precision (P rec), recall (Rec), F 1 score and area under the Receiver Operating Characteristics (ROC) curve are used to evaluate performance [40].These metrics can also be extended to multi-class classification tasks in a fairly straightforward manner, since there is still only one label per sample.The performance is calculated per class and then averaged across all classes.This technique is known as macro-averaging.Averages can also be weighted by the number of instances of each label in the dataset, especially useful in the case of imbalanced datasets [52].
The ability of a single test instance to be associated with multiple labels simultaneously, however, makes multi-label performance evaluation much more complex than in the traditional single-label learning environment [53].With multi-label models, micro-averaging is possible using the total number of true and false positives (T P and F P , respectively) and true and false negatives (T N and F N , respectively) to calculate the average globally.It is also important to note that any of the multi-label metrics discussed below can be used to describe multi-class performance by treating the multi-class dataset as a multi-label one for which there happen to be no multi-label instances (micro-averaging for all multi-class metrics is equivalent to calculating accuracy).
There are two general categories of multi-label metrics: label-based metrics and instance-based (also called example-based or sample-based) metrics [53], [54].Label-based metrics evaluate the machine learning model separately for each class label and then return either the micro-or macroaveraged value across all class labels, Eqs. 10 and 11.Given a testing dataset, where M represents the test dataset size, x i is the i -th feature vector (i.e., the i -th test sample) and Y i is the set of true labels associated with the i -th test sample: Eq. 10 describes how to calculate the value of T P , F P , T N and F N with respect to the j -th class label, y j , where f (x * i ) (or in case of the BDL f (x * i )) is the set of predicted labels output by the model f , or in case of BDL f , on x * i .Eq. 11 then describes how to use these values to compute traditional performance metrics using either macro-or micro-averaging, where B ∈ {Acc, P rec, Rec, F 1 } and C is the number of classes.Similarly, a macro-averaged area under the ROC curve (AU C) can be calculated by first computing the AU C for every class in a "one-vs-rest" manner and then averaging over C [53].
Since each instance for which the model makes predictions can be associated with more than one label, instance-based performance metrics can be aggregated by evaluating each test example individually and then averaging across the whole test set (in the multi-class case, the label-based and instance-based calculations are the same).For multi-label accuracy, we use subset accuracy as defined in Eq. 12, where [[q]] returns 1 if predicate q is true and 0 otherwise.This equates to the proportion of samples where the set of predicted labels for each sample, f (x * i ) exactly matches the set of true labels, Y i for the sample.This measure is intuitively the counterpart to traditional accuracy (the proportion of samples a model got "correct") and is the strictest measure of multi-label accuracy [53].
The instanced-based methods of calculating other traditional performance metrics are listed in Eq. 13.Precision and recall are calculated for each instance by taking the size of the intersection of the set of true labels, Y i , and the set of predicted labels f (x * i ), or f (x * i ), divided by the size of the set of predicted labels or the size of the set of true labels, respectively.F 1 is then calculated in the usual way using the instance-based precision and recall.
The final metric we discuss here is Hamming loss (HL), which evaluates the fraction of labels which are incorrectly predicted, that is, a relevant label is not predicted or an irrelevant label is predicted [53].For each test instance, HL (Eq.14) is the size of the symmetric difference, ∆ (equivalent to XOR in Boolean logic), between the set of predicted labels, f (x * i ), and the set of true labels, Y i , divided by the number of classes, C [55].The individual instance HLs are then averaged across all instances, and lower values indicate better performance.In the multi-class case, HL is equivalent to 1 − Acc.
In this paper, in order to give a broad picture of the comparative performance of the several models tested, for both multi-class and multi-label models we report the macro-averaged and weighted-averaged label-based precision, recall, and F 1 score as calculated in Eq. 11, as well as the macro-averaged AU C and the HL (see Eq. 14).For multi-label performance, we also report label-based micro-averaged and instance-based precision, recall and F 1 score as shown in Eqs.
11 and 13.Accuracy is reported as discussed in the examination of Eqs.11 and 12 above.

B. Performance Overview and Comparison
Our experiments examined the performance of traditional deterministic (non-Bayesian), MC Dropout, and flipout versions of both our custom CNN and ResNet models.For the ResNet models, the ResNet32V1 versions consistently performed better than the other ResNet configurations tested.For example, weighted average F 1 scores for the multi-label MC Dropout versions of the model were 0.78, 0.76, and 0.71 for ResNet32V1, ResNet20V1 and ResNet8V1, respectively.
Because this general pattern held across all versions, only the ResNet32V1 results are reported here.We trained models of each of the versions on the full HARP dataset for multi-class and multi-label classification, respectively.The results for multi-class classification are reported in   In general, models perform the best on Classes D and E and worse on Classes A, B, and C.
Intuitively, given that Class E represents the "no ship" class and Class D encompasses the largest targets, this might not be a surprising result.Classes A, B, and C are more challenging for the models to distinguish and perform well on.We speculate that the main reason behind this could be the similarity of some target signatures across the three classes and large intra-class signature variation (see the ontology in Table I).Fig. 5 illustrates a key benefit of the use of BCNNs as compared to traditional CNNs: the ability to get uncertainty measurements on each prediction.Not only does this value give us more insight into the performance of our model, it can also be used to triage only the most ambiguous classifications for further analysis by experts [36].Examining the multi-class MC Dropout model in Fig. 5    samples but improves the weighted F 1 score from 0.82 to 0.90.For the multi-label MC Dropout model, using the same filter results in retaining 86% of the data and increases the weighted F 1 score from 0.84 to 0.88.Thus the model is able to achieve significantly higher performance on a relatively large portion of the original dataset by removing the samples it is most uncertain about.
These samples can then be put in a queue for further expert analysis.In crowded hydroacoustic environments, prioritizing samples by their uncertainty for analysis by sonar operators can enable ships, submarines, or shore monitoring stations to more efficiently process and categorize sonar contacts and more effectively allocate the scarce resources of operator time and attention.
Mean predictive probability, Eq. 6, is calculated through Monte Carlo integration with T samples.We evaluate the consistency of the prediction relative to the number of samples for a custom model in both dropout and flipout configuration, Table II  In this work we chose to use T =50 given that it provided consistent predictive probability and that it was comparable to T sizes reported in the literature.For example, Filos et al. [36] used T =100 when comparing dropout and flipout methods in medical applications.
We recognize that the number of Monte Carlo samples required to achieve consistency in prediction is affected by numerous hyperparameters, including the underlying data distribution, model architecture and the choice of the Bayesian method (e.g.flipout, dropout, reparametrization), to name some.In practice, especially in live inference applications, one will likely choose to utilize a smaller sample size after conducting similar analysis to the one above.This involves the likely trade-off between computational and/or power requirements on one hand and consistency of prediction and/or calibration of uncertainty on the other.
Both MC Dropout and flipout Bayesian architectures produce calibrated uncertainties, see  II and Table III.Given that the MC Dropout models have significantly fewer parameters and take less computational time than their flipout counterparts, MC Dropout is a promising option for sonar classification, especially for use on unmanned vehicles and other embedded systems.By producing calibrated uncertainties we also confirm that the used uncertainty formulation, Eq. 8, suffices for our purposes in spite of additional uncertainty from approximation errors (e.g.Monte Carlo gradient estimation).

C. Case Studies and Analysis
To explore how all of these results fit together in practice, we selected a ship and examined in detail the predictions our model made as that ship transited past the HARP sensor, which we have called Case Study I.The ship selected was a car carrier which sailed near the HARP sensor over about a four-hour period on February 25, 2013.We used the corresponding AIS information to build a range versus time plot and examined the model's predictive output and uncertainty along the track as shown in Fig. 6.As the ship approached, the model made several erroneous predictions and had high uncertainty until a range of roughly 15 km.With decreasing range, more sound information began to reach the sensor, making the predictions both more accurate and more certain.As the ship passes its closest point of approach and begins increasing range, we see the same effect, with less accurate and less certain predictions farther from the sensor.
The model predictions and uncertainties also capture two other hydroacoustic phenomena.The first is the bow null effect acoustic shadow zone, which occurs when the radiated engine and propeller noise (most often the main source of ship noise and generated towards the aft end of the ship) is partially blocked by the hull of an approaching ship [56].When the ship is moving away, the stern is exposed and the noise reaches the sensor unimpeded.This is seen most clearly in the large uncertainties and missed predictions on the approaching track, which continue in to a range of about 12 km.In contrast, on opening range, the uncertainties do not consistently rise until roughly 17 km, which is also where we observe the first missed prediction.Uncertainties also show a small spike as the ship is very close to the sensor, revealing the second phenomena: interference and acoustic bleed-over caused by the high sound pressure levels reaching the sensor.
The observation of these two phenomena, as well as better performance at closer ranges, in Case Study I matches what would be expected in real-world conditions and demonstrates the model's ability to reflect reality in its performance and uncertainty measurements.II).
As another case study, we trained additional versions of our custom model on a seasonal subset of the data.We preserve training strategy reported in Section III-C.In Case Study II, we looked at the performance of our general multi-class models trained on the whole year's data (see Table II) compared to the performance of models trained only on data from the winter months (December to March).The results of cross-testing both sets of models on the large and small dataset are summarized in Table VI.While the models trained only on the winter data had excellent performance on a held-out test set from the winter, their predictive power was not generalizable, with poor performance on the full year's data.The models trained on the whole dataset, in contrast, were unable to reach the peak performance of those trained on the smaller dataset, but are able to perform much better on the whole range of data.They also lose only a small amount of their overall performance when making predictions on the smaller dataset.
These results demonstrate the effects of seasonal variation on model performance as well as the diverse set of underlying distributions required in order to train a generalizable classifier for underwater soundscapes.As seen in Fig. 2, seasonal changes to the water column's SSP affects how the noise from even the same ship on the same track is received by the sensor depending on the time of year.Datasets which are based on samples collected over a relatively brief period are likely to be biased.This decreases the practical utility of models trained on them, even if the models seem to have solid performance.In order to capture all of these differences, datasets with samples from all throughout the year, and ideally across multiple years, are needed.Another approach could be to train multiple models with data from different years but the same season, thus creating several "seasonal expert" classifiers.

Fig. 1 .
Fig. 1.Examples of mel-log input features from the dataset.Color bar represents db-down scale

Fig. 2 .
Fig.2.Monthly sound speed profiles at the nearest GDEM point (32.5N117.75W) to the HARP's location (32.666N 117.707W).In 2a sound speed profile is illustrated for Dec-Mar time frame and in 2b sound speed profiles are illustrated for Dec-Nov months.Significant dispersion can be observed in 2b relative to 2a.

169
proximate the posterior distribution p(w | D) ∝ p(w)p(D | w) by fitting an approximation 170 q θ (w) ≈ p(w | D), where θ are the parameters of the probability distribution over weights.This . It can be observed that in both Bayesian configurations the weighted F 1 score becomes consistent within T =10 Monte Carlo samples for flipout and T =20 Monte Carlo samples for dropout, see Fig.4.

Fig. 4 .
Fig. 4. Consistency of the prediction, in terms of weighted F 1 score, as a function of Monte Carlo sample size T .

Fig. 5 .
Fig.5.However, our overall results indicate better performance for MC Dropout model architectures across the evaluated metrics for both multi-class and multi-label classification tasks as shown in TableIIand TableIII.Given that the MC Dropout models have significantly

Fig. 5 .
Fig. 5.Each point represents the the model's weighted average F 1 score vs. the ratio of overall number of samples retained when filtering out all samples above a certain predictive uncertainty value (measured by H * p )

Fig. 6 .
Fig. 6.Case Study I -February 25 2013, Car Carrier; range vs. time plot of a known target AIS track overlaid with BCNN classification output and predictive entropy, Hp.Classification outputs are color coded relative to the class where Car Carrier (classD) is correctly classified with magenta.Hp is scaled such that the size of the gray transparent circle corresponds to the percentile of the value range of Hp, with smaller circles corresponding to lower Hp values, and thus higher certainty.Predictions and Hp are from the custom multi-class MC Dropout BCNN trained on the full HARP dataset (see third column, Custom/Drop, in TableII).
underwater sounds is of great interest to the community and has seen significant progress with the proliferation of deep learning.This work addressed several challenges of using deep learning models for classification in acoustically heterogeneous environments and effectively establishes benchmark performance for both the multi-label and multi-class classification of underwater soundscapes.Additionally, this is also a first study demonstrating the quality and the utility of the uncertainty of neural network classification with a ship-based ontology on underwater soundscapes.Our best performing Bayesian model developed for the multi-label task achieves a weighted F 1 score of 0.84 and the model developed on the multi-class task achieves a weighted F 1 score of 0.82.In both of those tasks, models simultaneously offered measurement of uncertainty in per sample classification.This was achieved by adopting Bayesian Deep Learning, a new and emerging field, which can have significant implications on the proliferation of deep learning models in production.The presented analysis is universal and applicable to any other classification or regression task in soundscape monitoring.The demonstrated results and analysis of uncertainty in the first case study correlated well with a physical understanding of sound propagation.Moreover, in the second case study we demonstrated the ability to preserve model performance with the seasonal variation of underlying sound speed profiles, which was enabled by training on one of the largest datasets used in this type of analysis.Overall, the proposed approaches can have significant impact on autonomous monitoring of ocean resources through passive sonar.

Table
[17]hile the results for multi-label classification are in TableIII.In both tasks, the models based on the custom CNN outperformed their ResNet model counterparts in every metric.In most cases, the best performing version of the custom CNN was the MC Dropout BCNN, generally reflective of the performance enhancements seen in ensemble learning (MC Dropout can be viewed as an ensemble classifier where each inference is a different model)[17].The exception to this was in multi-label classification, where the deterministic and MC Dropout versions performed almost identically, varying at most by 2% in any one metric.We speculate, based on uncertainty measurements discussed below, that the greater predictive uncertainties involved in multi-label classification offset the performance gains often seen with ensemble learning by introducing enough variation that the MC Dropout model was unable to make more accurate predictions than the deterministic model.

TABLE II MULTI
-CLASS PERFORMANCE METRICS To further examine the performance of multi-label classification, we present per class analysis in Table IV for both custom CNN and ResNet32v1 model architectures in Bayesian config-

TABLE III MULTI
-LABEL PERFORMANCE METRICS urations.Consistent with previous analysis, we observe better performance with the custom CNN and MC Dropout BCNN model.Table V presents the same information for multi-class classification and demonstrates the same broad trends discussed below.
reveals that filtering out all samples with H * p greater than 0.375 retains about 80% of the

TABLE VI CASE
STUDY II -CROSS-COMPARISON OF THE MULTI-CLASS PERFORMANCE OF MODELS TRAINED ON THE FULL DATASET VS.MODELS TRAINED ON A SEASONAL SUBSET.ALL MODELS ARE BASED ON OUR CUSTOM ARCHITECTURE.THE LARGE DATASET CONSISTS OF SAMPLES FROM A FULL YEAR, FROM DECEMBER 2012 TO NOVEMBER 2013.THE SMALL DATASET IS A SUBSET OF THE LARGER, CONSISTING ONLY OF THE SAMPLES FROM DECEMBER 2012 TO MARCH 2013.