Novel Low Memory Footprint DNN Models for Edge Classification of Surgeons’ Postures

Skill assessment is fundamental to enhance current laparoscopic surgical training and reduce the incidence of musculoskeletal injuries from performing these procedures. Recently, deep neural networks (DNNs) have been used to improve human posture and surgeons’ skills training. While they work well in the lab, they normally require significant computational power which makes it impossible to use them on edge devices. This letter presents two low memory footprint DNN models used for classifying laparoscopic surgical skill levels at the edge. Trained models were deployed on three Arm Cortex-M processors using the X-Cube-AI and Tensorflow Lite Micro (TFLM) libraries. Results show that the CUBE-AI-based models give the best relative performance, memory footprint, and accuracy tradeoffs when executed on the Cortex-M7.


I. INTRODUCTION
W HILE minimally invasive surgery (MIS) has significantly improved patient outcomes, it puts surgeons at high risk of experiencing musculoskeletal disorders (MSDs) due to the requirements to spend extended periods of time performing repetitive movements while maintaining awkward postures [1]. This growing risk has been highlighted in many studies with warnings of an impending epidemic with >80% of at-risk surgeons reporting significant pain during surgery [2]. Our pioneering work [3] has clearly demonstrated significant differences when body mass indexes (BMIs) exceed 40 kg m −2 and highlighted the need for better assessment and monitoring tools to help the surgeons improve their posture during training. Two recent systematic reviews [4], [5] stressed the limits of traditional skill assessments and showed how deep neural network (DNN) models developed for laparoscopic surgical phase recognition have frequently reported accuracy Manuscript  rates above 90%. Generally, these assessment focus on the instrument mechanics or the kinematics of the hands/wrists of surgeons; this is easily performed in the lab (e.g., [6] or [7]), but it cannot be used in a live surgery or realistic training scenarios due to the strict sterility requirements, as fixing sensors under the gloves could compromise a surgeon's performance (and as a consequence increase the risk for the patient) or distort the data. There is also a need for a truly portable solution, which requires data analysis and classification to be performed at the edge on the sensors themselves and not transmit all raw data. The majority of DNN models, however, demand substantial resources (e.g., memory footprint) which restricts their adoption in resource-constrained edge devices such as wearable. To enable the deployment of DNN models in these devices, different software libraries and application programming interfaces (APIs) are being proposed [8], [9]. They provide a set of functions and kernels devoted to optimizing the performance and minimizing the memory footprint of DNN models, thus allowing their efficient execution on edge devices that integrate low-power microprocessors and sensors. Although performing DNN inference models at the edge improves latency and privacy issues, the deployment of such models in resourceconstrained devices is a challenging and time-consuming task, which must consider the tradeoff between power consumption, memory footprint, response time, and accuracy of the model [10]. These metrics vary depending on the processor type. Therefore, finding an efficient architecture that produces accurate results and meets real-time constraints is vital.
The objective of our work is to investigate the capability of machine and deep learning models for the classification of a surgeon's skill level by focusing on the movements of just the upper arm. For this, we need to establish the viability of a microcontroller-based surgical posture classifier to enhance current laparoscopic surgical training methods by taking advantage of a more highly contained microcontrollerbased skills classification system. In this regard, the main contribution of this letter is twofold: first, the adaption for operation on resource-constrained systems of two DNN models for classifying the skill level of laparoscopic surgeons; second, an extensive relative performance and accuracy analysis considering the deployment of promoted models on different Arm Cortex-M microprocessors, which are highly adopted in smart sensor systems. The conducted analysis also compares the relative performance of the proposed models when converted using the X-CUBE-AI [8] and Tensorflow Lite Micro (TFLM) [9] libraries to determine which is the most effective.
1943-0671 c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.  Table I summarizes the works on DNN-based surgical skill and posture classification. Except for the proposed convolutional neural network (CNN) model that shows an accuracy of 86.3%, the remaining models reported accuracy rates above 90%, corroborating previous evidence [5]. The relative accuracy of the models varied between the datasets, which are appropriate and light enough to be manipulated by DNN models running either on general-purpose [6], [11], [12] or resource-constrained processors. All datasets consisted of data collected from at least eight participants [11], [12].

II. RELATED WORKS
Except for [13] and this work, reviewed approaches do not contemplate resource-constrained devices, but rather the deployment of CNN models in high-performance processors [6], [11], [12]. Licciardo et al. [13] proposed a custom hardware design of a CNN to classify sitting and laying posture, deployed on a field-programmable gate array (FPGA) this clearly outperforms the software solutions [6], [11], [12]. In contrast, this work considers the deployment of an artificial neural network (ANN) and a CNN posture classifier model for laparoscopic surgical skills training. On the memory footprint side, the presented ANN and CNN models provide significant memory savings (MSs) (i.e., < 1 kb) w.r.t. the other models, which require at least 5.80 kb. This letter also distinguishes from previous works by benchmarking both models on three different microprocessors taking into account memory saving, accuracy, and performance efficiency.

III. ANN AND CNN POSTURE CLASSIFIER MODELS A. Description of Developed Classifier Models
The ANN contains four layers in total-the input layer (126 values for the surgical dataset and 561 for the UCI dataset), two dense layers consisting of 400 and 200 neurons, respectively, and the output layer as shown in Fig. 1. Each dense layer is followed by a batch normalization, ReLU, and dropout layer. A softmax layer is used as the last activation function to normalize the output to a probability distribution function encompassing the six predicted output classes, for the surgical dataset, this is three levels of skill and two BMI densities.
The CNN is composed of four 1-D convolutional blocks, with 32, 64, 128, and 256 kernels of size 7, 5, 5, and 3, respectively. The kernel weights matrix is initialized using the Uniform He function and each convolutional layer has a stride size of 1 with the same padding. The four convolutional layers are each followed by a max-pooling layer, then batch normalization, ReLU, and dropout layers similar to the ANN. After the final convolutional block, there is a global average pooling operation, dense layer then finally a softmax layer which is fed to the output. For the surgical dataset, the input size is 256 × 18 and for the UCI dataset, the input size is 33 × 17.

B. Classifier Models Training
The laparoscopic training dataset used in this work contains motion data obtained from ten participants performing a standard laparoscopic threading task commonly used during surgical skills training [14]. The University of California, Irvine (UCI) human activity recognition uses data obtained from an accelerometer and a gyroscope worn on the waist of 30 volunteers performing six activities, such as standing and laying down [15].
Both the ANN and CNN models are built using the Keras API and trained using Python version 3.7.10 with TensorFlow version 1.15.1 and Keras version 2.2.4. Post-training quantization of the models is performed using the TFLite Converter with the input and output types both specified to be 8-bit integers. The scaling and bias are determined from the test dataset corresponding to the model. The training parameters used for the models' training were as follows. ANN: learning rate-0.001; batch size-32; and epochs-100-200. CNN: L2 regularization-0.001; batch size-32; and epochs-200.

C. Method and Experimental Setup
In total, 24 test cases are considered (2× datasets, 2× network types, 2× implementation APIs, and 3× boards). Each application was generated using STM32CubeMX (version 6.3.0, GCC version 9.3.1) to generate the initial application based on the X-CUBE-AI (version 6.0.0, TensorFlow version 2.3.1) "System Performance" template. The system performance template includes logging functionality to provide layer-by-layer performance analysis, this was modified to use a serial connection to receive input data. Each of the applications was compiled with the release configuration and tested for the full validation set of their respective datasets (628 for the surgical ANN model, 2147 for the surgical CNN model, and 2947 for the UCI models). All results in this work are an average of the results for the entire dataset.
The test boards, shown in Table II, were configured to their maximum clock frequency and with all caches enabled. For the M33, the architecture itself does not include any cache, however, the MCU used included vendor implemented instruction cache. The target boards were selected to cover the three Cortex-M processors with single instruction-multiple data (SIMD) capabilities. The CNN models could be tested only with the X-CUBE-AI API due to the version of TFLM used not providing an int8 version of the reduce layer.

A. Models' Accuracy Evaluation
The accuracy of each case was a function only of the model itself, and unaffected by either the device or the API used to implement the model. In both cases, the ANN model architecture provided the better accuracy, with this being most notable for the surgical dataset exhibiting over a 12% point difference w.r.t. the CNN, whereas for the UCI models, the difference is less than 0.5% points, as shown in Table III. A factor that may contribute to the scale of the changes in accuracy is the difference in input processing between the surgical ANN and CNN models. While the UCI models and surgical CNN model all use the raw motion data from the datasets as inputs, the surgical ANN model requires preprocessing the raw data using a discrete Fourier transformation (DFT).

B. Models' Performance and Memory Overhead
This section only considers the ANN models, which outperformed the CNN models (see column 4, Table III), due to their capacity to generally perform well when trained and used with a small amount of data.
In addition to accuracy, four parameters have been compared: performance (PF), MS, flash saving (FS), and stack saving (SS). Performance is the rate at which a network can be invoked, calculated from the cycles needed to run the network inference. All MCUs are run at their maximum clock speeds so the cycle count can be considered a worst case for these specific MCUs. MS represents the amount of RAM saved benchmarked on the highest and lowest values. Calculated from .data and .bss values reported by the compiler. The FS metric is the ROM counterpart to MS, calculated from the  sizes of the .text, and .data sections reported by the compiler. SS is derived from the peak stack usage of the model, this is an internal stack, defined at compile time and managed by the API. The system performance template enables monitoring of this internal memory use. SS is given as the average stack use compared to the worst case scenario and represents the memory overhead of the model itself. Fig. 2 identifies general trends.
Using these performance metrics, Fig. 3 shows the relative change in each metric for M4 and M33 compared to the corresponding M7 case. The most apparent change between the different processors is performance-the performance increases consistently when moving from M4 to M7. This pattern is expected due to the more complex architecture of M7, using a six-stage superscalar pipeline against the scalar threestage pipelines in M4 and M33, M7 also allows for limited instruction-level parallelism. M33 improves on M4, using the newer ARMv8-M architecture, and allowing early completion and the dual issue of some instructions.
The other parameters show much smaller improvements but following the same pattern. Decreases in FLASH size are mostly attributable to the additional instructions available to M7 and to a lesser extent M33. The reason for the reduction in memory use is likely due to the additional instructions reducing the need for intermediate values. Both conclusions are drawn from the binary size values indicated by the compiler, the .data size remains within 4 bytes for equivalent cases between processors. This indicates that the changes in FLASH and RAM use are driven by changes in the code size and uninitialized variables, respectively.
Examining the reduction in FLASH, for the UCI dataset, of the CNN versus the ANN, the breakdown of the parameter size by layers when the input size of the ANN increases from 126 (surgical) to 561 (UCI) the input parameter size increases by a similar factor (×4.5). For the CNN, the increase in parameter size is much smaller (×1.05), this accounts for the smaller increase in FLASH requirement for the CNN over both APIs.
From Fig. 2, the most obvious characteristic affected by the choice of API is stack usage, which is significantly reduced in the X-CUBE-AI implementations, by around 15%. A similar, but lesser, improvement in performance is shown in Fig. 2, but Fig. 4 shows a significant reduction in execution cycles compared to TFLM and a further reduction in FLASH usage. TFLM does show a modest decrease in RAM usage but this is marginal. Finally, considering the two datasets used to train the models, the surgical dataset presented here has the advantage in all characteristics for the ANN models. However, for the CNN implementations, the model trained using the UCI dataset performs significantly better.

V. FINAL REMARKS AND CONCLUSION
This letter presents two new low memory footprint models to be used to classify laparoscopic surgical skill levels at the edge. The ANN network trained on the surgical dataset provides a high accuracy (98.89%) comparing favorably against existing models while reducing hardware requirements. The X-CUBE-AI API produced the best overall implementations, although this is restricted to the STMicrolectronics devices and TFLM may still be desirable where this is a concern. The M7 processor was shown to produce the best performance in all characteristics, with M33 similarly outperforming M4. These preliminary findings confirm that the analysis of just the upper arms is a viable and accurate alternative for the classification of surgeon skills where sterility requirements are enforced and edge processing and classification are required.