Building footprint extraction from very high-resolution satellite images using deep learning

ABSTRACT Building footprint datasets are valuable for a variety of uses in urban settings. For a number of urban applications, polygonal building outlines with regularised bounds are required and are extremely challenging to prepare. We propose a deep learning strategy based on convolutional neural networks for retrieving building footprints. The model was trained using images from a variety of places across the metropolis, highlighting differences in land use patterns and the built environment. The evaluation measures indicate how the accuracy characteristics of distinct built-up settings differ. The results of the model are equivalent to cutting-edge building extraction methods.


Introduction
Building footprint data in an urban setting can be used for a variety of purposes, including urban planning and infrastructure development (Eirinaki et al. 2018). Telecommunication network planning (Haala and Kada 2010) and locating tower placements require a high density of structures in the jurisdiction. Building datasets are extremely helpful in implementing rooftop solar photovoltaic and rainwater harvesting schemes, especially if the data is available in geospatial format (Wiginton et al. 2010, Mun andHan 2012). Up to date databases are useful for a variety of urban projects, including slum redevelopment (Das et al. 2014) and transportation planning . Law enforcement and the detection of illegal construction (Wurm and Taubenböck 2018) necessitate the use of legitimate data validation procedures. For disaster-prone locations to prioritise resource distribution and evacuation strategies, urban datasets are critical in disaster risk assessment and development of mitigation plans (Ehrlich et al. 2010).
Rapid infrastructure expansion has occurred in recent decades, necessitating frequent dataset updates (Koc- San and Turker 2012). Remote sensing is the most widely used approach for acquiring data with the advantages of scale and affordability, and it is widely employed in the development of land use maps (Bharath et al. 2017), utility maps, road networks (Senthilnath et al. 2020), and other purposes. The availability of higher-quality satellite images and geospatial data providers is steadily increasing. High-performance computing systems, such as graphical processing units (GPU) and cloud computing, are beneficial in this regard. Over the last three decades, various techniques have been used to study automatic building footprint extraction, including the shadow of buildings on images (Liow and Pavlidis 1990) at a specific sun angle, spatial feature-based metrics (Sirmacek and Unsalan 2008), edge detection (Ferraioli 2009), feature shape (Dunaeva and Kornilov 2017), height dimension (Zhong et al. 2015), and machine learning (Rastogi et al. 2020, Chen et al. 2020b). However, generalising automatic approaches in scenarios where built-up environments vary greatly within a city or locality is challenging.
The utility of remote sensing images has become more diverse due to recent advances in deep learning techniques and increased computational capacity. Building footprint extraction is formulated as a semantic segmentation challenge in the deep learning approach. Semantic segmentation is a method of categorising all pixels in a digital image into a set of categories. Segmentation's main goal is to transform a raw image into a more meaningful representation in a certain context. Semantic segmentation (Längkvist et al. 2016, Fu et al. 2017, Maggiori et al. 2017, Yi et al. 2019) was undertaken on satellite images using deep learning approaches to generate building footprints over a large geographical area. Several other automated algorithms retrieved building attributes from remote sensing images before deep learning-based segmentation. To classify building pixels from remote sensing images, San and Turker (2006), Soumya et al. (2018) and Prakash et al. (2018) employed methods such as maximum likelihood classifier (MLC), random forest (RF), and support vector machine (SVM). Although these strategies enhanced the accuracy of segmentation, they remained insufficient for dealing with complicated metropolitan situations. Li et al. (2018) examined image classification approaches using deep learning to classify multiple land use classes.
Current building extraction models are essentially semantic segmentation of images (Chhor et al. 2017) that employ models established for a range of different applications, such as U-Net for medical image segmentation (Ronneberger et al. 2015) and others. While reconstructing the U-Net model for satellite images for building footprint extraction, we noticed a slew of undesired artefacts and scope for improvement in the mode. As a result, we increased the depth of the model and changed numerous parameters at each stage, such as the number of convolutions, input picture dimension, normalisation, and so on. The model summary (appendix 1) gives the various parameters used to fine-tune it for remote sensing images. We also demonstrated the model's performance in a variety of prominent built-up environments that were selected through thorough visual observation.

Related works
Researchers from all over world have utilised VGGNet (Yang et al. 2018), GoogleNet (Ostankovich and Afanasyev 2018), ResNet (Wen et al. 2019), AlexnNet (Vakalopoulou et al. 2015), water body extraction neural network (Chen et al. 2020), urban green spaces network (Chen et al. 2021) and SegNet (Yang et al. 2018) to segment images in order to extract certain features, with promising outcomes. U-net (Ronneberger et al. 2015), a convolutional neural network (CNN) based model developed for medical image segmentation, has recently been utilised to extract features from satellite imageries (Chhor et al. 2017, Rastogi et al. 2020, Wang and Li 2020. This model structure can capture the context of spatial features and allows precise localisation since it uses up sampling and down sampling techniques for classification (Guo et al. 2020, Soni et al. 2020. No dense layers are employed in this architecture, which is made up entirely of completely convolutional layers. As a result, the model may be trained with images of various sizes (McGlinchy et al. 2019). For semantic segmentation of satellite imagery into different land use classes, Wu et al. (2019) used a modified U-net architecture and a transfer learning technique. Xu et al. (2018) created Res-U-Net for building extraction and used guided filtering to increase segmentation performance. They also established a scope to include transfer learning with modified U-net for building extraction.
The encoder-decoder architecture utilised in CNN-based deep learning architectures for remote sensing image segmentation is designed to achieve the desired objectives. In previous research SegNet (Yang et al. 2018), ResNet (Wen et al. 2019), and VGGNet are three popular models that have been employed (Yang et al. 2018). Convolution, batch normalisation, and non-linearity are used by the SegNet encoder, which is then followed by max-pooling. SegNet has a smaller number of parameters and consumes less memory. SegNet's main feature is that it avoids the requirement for upsampling while preserving high-frequency details. During implementation, the increased number of layers in this model generates a vanishing gradient issue. Researchers at Microsoft Research first proposed ResNet, or Residual network, to overcome the problem of vanishing gradients. As a result, ResNet is best suited to networks with a large number of layers. The 'skip connection' here bypasses a few stages of training and links directly to the output. The network fits the residual mapping instead of using layers to learn the underlying mapping. Regularisation will skip any layer that degrades the architecture's performance. The VGGNet architecture drastically reduces the number of parameters in the convolution layers and training time. All the convolution kernels are of size 3 × 3, and max-pool kernels are of size 2 × 2 with a stride of two. With fewer trainable variables, learning was faster and more resistant to overfitting. All of these basic model architectures were proposed for the encoder structure, and the decoder was designed to up sample the input using the indices saved during the encoding stage.
According to careful observation of predicted outputs from prior models, the performance of deep learning models is dependent on the built-up environments to which they are exposed. The goal of this project is to fine-tune the model for an urban scenario with a variety of built-up environments. In the current study, a modified U-net model is used to extract building footprints. The original U-net model architecture (Ronneberger et al. 2015) has been enhanced to suit a specific satellite image that was previously designed to segment very high-resolution aerial images. The model is trained using transfer learning approaches because the research area has a variety of built-up environments. The performance of the trained deep learning model is evaluated in a variety of built-up environments, including commercial, sparse residential, dense residential, industrial, and mixed urban. In addition, standard deep learning architectures such as VGGNet, SegNet, and ResNet, which are utilised for remote sensing image segmentation, are employed to compare the model.

Network architecture
The encoder and decoder parts of the semantic segmentation model are a contracting path and an expanding path, respectively. An encoder utilises a series of convolutions and a downsampling or pooling technique to generate higher-level features with reduced spatial resolution from an image of a particular input size. The decoder's job is the exact opposite of the encoder's. It performs upsampling and convolutions to obtain an original image size from the contracted image. The advantage of an encoder-decoder design is that it can cope with features of different sizes and dimensions. Figure 1 depicts the architecture of the proposed model. At each level, the images are downsampled and convolutions are applied. On the left side of the diagram, this is depicted. The contextual information in the image is extracted and interpreted using this approach. The image is up sampled to its original size on the right side, which aids in feature localisation.
The model built for the current research is divided into the following three parts: (1) Down sampling layers: It is made up of five blocks, with each block consisting of • 3 × 3 convolution layer, ReLU (Rectified Linear Unit) activation function • 3 × 3 convolution layer, ReLU activation function • Max pooling, dropout, batch normalisation The number of feature maps is doubled at the start of each block's starting, from 32 to 512. The reason for down sampling in this architecture is to find the contextual information present in the image taken to an incremental path using skip connections.
(2) Bottom layer: This layer consists of two convolution layers that lie between the two parts of the network model. This layer has 1024 feature maps. (3) Up sampling layers: The up sampling path comprises of five blocks, each consisting of • Concatenation layers, Batch normalisation • 3 × 3 convolution layer, ReLU activation function • 3 × 3 convolution layer, ReLU activation function • 3 × 3 up convolution or up sampling, dropout In the end, a convolutional layer of 1 × 1 is incorporated to map 32 feature maps to a single layer having two classes, and a sigmoid activation function is used. The predicted map consists of probability values ranging from 0 to 1, which need to be converted to a binary map using a carefully chosen threshold. The various parameters of the network are explained in the following subsections.

Activation function
The activation function's main job is to define the neuron's output for a given set of input values. The ReLU activation function is used in the proposed model, as indicated in the model architecture. After each convolution layer, the ReLU aids in the introduction of nonlinearity to the images. Because the satellite image contains highly heterogeneous features, it is critical to include non-linearity in the model when extracting building footprints. The representation of the ReLU activation function is given in Figure S1.

Loss function
The loss function in the current model is weighted binary cross-entropy. It's a crossentropy variant in which all positives are weighted by a coefficient. In a dataset, the weighted binary cross-entropy aids in minimising the biases of the two classes. The loss function's ability to overcome local minima improves when the weighted operation is used. The following is a description of the loss function's equation. The sigmoid activation function is used at the last layer to predict probability values between 0 and 1. On a caseby-case basis, the value of β can be adjusted based on the model's performance. The loss function cross-entropy is given by The weighted binary cross-entropy is given by Set β > 1 to decrease the number of false negatives Set β > 1 to decrease the number of false positives

Data augmentation
Data augmentation is a technique for increasing the diversity of data that may be utilised to efficiently train machine learning models. Cropping, flipping, padding, rotation, and scaling are examples of image modification techniques that serve to enhance the number of different datasets available for training models. Data augmentation was utilised to train the model by Marmanis et al. (2018) and Ji et al. (2019). When there aren't enough training examples available for image segmentation, data augmentation is employed as a supplement dataset. It aids in infusing the required invariance into the model during training. According to the current research work's experiments, spatial data sets are an exception. The experimentation with data augmentation affects the model adversely while training. This could be attributed to possible disjoint creation at the corners of every tile (64 × 64, 96 × 96, 128 × 128 and 256 × 256). Instead of improving the accuracy model, data augmentation operations in this scenario began unlearning the spatial context, resulting in a decrease in accuracy parameters.

Hyper parameters
The hyper parameters of the model had to be carefully tuned to suit certain visual characteristics. Learning rate, drop rate, batch size, image dimension, and number of filters are some of these parameters. The model takes longer to minimise the loss function if the learning rate is low, and if it is large, the step size may miss the minimum points. To ensure a better model training process, the Keras library's learning rate finder function is used to determine the appropriate learning rate range. The model loss versus learning rate curve was plotted ( Figure S2). When the curve shows a severe drop, the values in that range are considered to be used in the model. The drop rate is set to 0.2, which means that 20% of random neurons are removed from the network after each max-pooling layer for improved model performance. During model training, a batch size of 16 images with an image dimension of 256 × 256 pixels is pulled for each epoch. As indicated in model representation Figure 1, the number of filters or data layers vary in the model. Within the contracting path, filters are doubled in each layer, whereas in the expanding path, they are lowered by half.

Data normalisation
The goal of the normalising image dataset is to convert a pixel's digital number (DN) into a standard scale. This is accomplished without distorting the dataset's discrepancies in actual DN values. By reducing the dataset's imbalances, this technique makes learning faster and more efficient during training. The normalising formula is as follows:

Transfer learning
The study uses the transfer learning technique to train the model for a variety of built environments, including commercial, sparse residential, dense residential, industrial, and mixed urban. The training datasets are collected from various parts of the city as a rectangular image measuring 5 × 5 square kilometres, and the model is trained. The model is trained using a transfer learning strategy to get the benefits of data diversity from multiple build-up conditions. The trained model's weights are loaded, and subsequent training is carried out by adjusting the weights. Transfer learning allows a single model to be trained for a variety of built environments, allowing the model to be generalised.

Evaluation metrics
The procedure of extracting buildings from satellite images involves segmenting the image into two classes. As a result, it is categorised as a classification problem, and the quality of the results is assessed using a confusion matrix (Demir et al. 2018, Shrestha andVanneschi 2018). It aids in the generation of a variety of accuracy parameters, including classification accuracy, precision, recall, Intersection over Union (IoU), and many others. Classification accuracy is the ratio of correctly classified pixels to all classified pixels. Precision is defined as the number of times a classifier is correct when it predicts positive results. Recall stands for analysing the correctness of a classifier based on the number of times it predicted correct values within a given series of positive values. The IoU will be used to determine whether existing and anticipated building areas overlap. Table 1 contains the accuracy parameter equations.

Data and study area
A panchromatic band and four multispectral bands (red, green, blue, and infrared) make up the high-resolution satellite image (TripleSat). The spatial resolution of the panchromatic band is 80 centimetres, and the spatial resolution of the multispectral bands is 3.6 metres. The image is made up of 10 bits of radiometric data. Building masks are created by manually digitising the buildings as vector polygons and then converting them to a raster format. Figure S3 depicts a subset of the image and masks. The image is divided into 256 × 256-pixel grids to train the deep neural network model. In the current work, satellite images from Bangalore city, Karnataka state, India, are used to depict various built-up environments. Bangalore city is chosen as a study area for two major reasons. The first one is that Bangalore has drawn a lot of interest and traction in terms of development in real estate over past the 30 years and is expected to continue the same trend. The statistical information derived from the study would have higher significance for future work as there are already plenty of remote sensing-based studies available in the literature. Secondly, the deep learning models are not widely applied to Indian cities as they do not follow the characteristics of planned cities which are more structured and homogenous. The city has all types of built-up conditions, from planned residential suburbs to highly populated slums. The city's population is 10 million plus, and as per government records, 2.5 million plus tax paying properties are present in the city. Figure 2 depicts the entire procedure, which is separated into four steps. The relevant dataset is prepared for direct usage in the model in the first stage by conducting normalisation. The second stage involves training a CNN-based deep learning model with images and building masks using a transfer learning technique. The trained model is then applied to an unknown location with varying built-up environments for building extraction in the third stage. Using ground truth or building masks, the final procedure involves calculating evaluation metrics.

Image preprocessing
The satellite image obtained with the TripleSat sensor from India's National Remote Sensing Centre (NRSC) is comprised of two different datasets, the first of which is a panchromatic band with better spatial resolution and the second of which is a multispectral band with less spatial resolution. To obtain the benefits of increased spatial and spectral resolution, the multispectral image is pan sharpened with the assistance of IHS transformation as per higher UIQI value ) using the panchromatic band. As a result, the final image comprises four bands from a multispectral image with improved spatial resolution. The buildings were digitised by hand, and building masks were created in raster format with metadata values similar to those of the source images. The image and masks are separated into 256 × 256 tiles, each with approximately 1000 image tiles. Before the training, these image tiles are normalised.

Model training
In terms of depth and hyper parameters, the modified U-Net model offers a novel technique for improving feature extraction from a number of built-up environments. In its depth or max-pooling operation, the current model differs from the original U-Net model (Ronneberger et al. 2015). Eighty percent of the labelled datasets are utilised to train the model, and the remaining 20% are used to validate the model. While training, the accuracy of training and validation data is monitored ( Figure S4). The model training's performance is evaluated for a variety of tile sizes, including 64 × 64, 96 × 96, 128 × 128 and 256 × 256. This gives the computer machine the option to change batch sizes from 16 to 64. Figure 3 shows the improvements in prediction with the inclusion of a variety of data from different built-up conditions. Training of models is carried out in the transfer learning method by using the weights of the previous training stage.

Postprocessing
Predictions are delivered as a single band image with the same dimension as the input shape (256 × 256), which must be stitched together to make the complete image. The predicted output image is assigned the coordinate reference system (CRS) and the datum of the original image. The model's prediction ranges from 0 to 1 as probability values. The numbers '0' and '1' denote the background and building classes, respectively. Two methods were used to convert the probabilistic output from the model values into a binary class: (i) Thresholding method -in this method a value of 0.5 resulted in the best accuracy. (ii) Conditional Random Field (CRF) prediction. When it came to generating binary image outputs, the thresholding strategy outperformed the CRF method. The noise removal technique is used to remove any artefacts in the output. Morphological closing and opening operators are employed to increase the accuracy of predictions or to bring them closer to reality, but they often have the unintended consequence of lowering the output quality in terms of assessment metrics.

Results and discussion
A CNN-based deep learning architecture is developed in this paper to extract building footprints from high-resolution satellite imagery. An 80-centimetre spatial resolution satellite image with RGB bands was used in the experiment. Image grids and binary building masks are used to train the proposed model. The model is trained using the transfer learning technique for a variety of built-up situations, including 579 image tiles of dimension 256 × 256 and 167 images in the validation datasets. As the models are exposed to similar built-up environments, their ability to predict different types of buildings improves. The proof can be seen in Figure 3, where the model's prediction for a certain type of building improves with each training stage. Visual inspection of the model's output reveals that it is capable of generating building footprints that correspond to ground reality. Figure 4 shows the image and the corresponding ground truth or building mask, as well as the prediction obtained from various models (additional images are given in Figure S5). Figure 4 shows a series of images with identical dimensions that cover one square kilometres. A close examination of the results indicates that roads are classified as buildings in a few spots, resulting in small undesirable patches. Several buildings with irregular shapes are also poorly captured, and the model is unable to discriminate between structures that are close in proximity. This reveals a flaw in the model's ability to split small features found in satellite images. Because images are split into 256 × 256 grids during prediction, disjointed predictions can be observed along the margins between two grids. This might result in the irregular development of building polygons during vectorisation, as well as the creation of holes in some circumstances. During the training stage, the errors at these edges were observed to increase as the grid size was reduced to 64 × 64, 96 × 96, 128 × 128. The tendency could be the other way around, and it can be assessed with more computing power. This model is designed for images with a rectangular shape.
The potential for transfer learning is realised by training the models to different urban built-up conditions within the chosen city. The model is now being tested for built-up settings such as residential (sparse and dense), industrial, commercial, and mixed urban environments. During the training process, none of these places are presented to the model. The systematic knowledge of the prediction pattern was aided by testing the model for various built-up environments. The inspection definitely demonstrated finer forecasts in the residential area; yet, the commercial and industrial areas had fewer and distinct buildings. Looking back at the training data, it mostly consisted of residential structures with similar shapes, textures, and colours to the residential building area selected. Even though the roofs with undetected colour elsewhere in the training data are large and separated by significant spacing, the model does not capture them correctly in the industrial built-up class. In the current experiment, the modified U-Net outperformed previously developed models such as Res-Net, Seg-Net, and original U-Net. ResNet's prediction was more accurate than the other two models, and the roof area was underestimated, whilst VGG-Net over-predicted the building area.
The performance of the models is evaluated using parameters such as accuracy, precision, recall, and IoU (Table 2). According to the evaluation metrics, an overall accuracy of 0.95 is achieved in segmenting an image in a sparse residential built-up condition using a modified U-Net. When compared to the ground truth, this means that the model correctly classified 95% of the pixels in the image. An IoU of 0.94 in the sparse residential category indicates a 94% overlap between ground truth and the segmented map. Table 2 shows that the developed model could produce overall built-up environments with a minimum of 90% accuracy. The comparison of the four models shows that modified U-Net outperformed the other three models across all accuracy parameters in sparse residential built-up conditions ( Figure S6) and in various built-up conditions ( Figure 5). The models are developed in the Python environment using the Anaconda open-source distribution. The proposed method is simple yet robust enough to perform semantic segmentation of remote sensing images of any size. The size means, if its 5000 × 5000 pixels or 25,000 × 25,000 pixels or any other size. Basically, the developed model takes images of tiles of square size (64 × 64 or 128 × 128 or 256 × 256 or higher) from full image and trains. The prediction is also working in a similar manner, The difference is that while predicting, the output is merged again to make it into the original image size, say 5000 × 5000 for example. In other words, once the predicted image formation happens, it can be directly viewed and manipulated in any GIS tool such as QGIS. In terms of architecture, it preserves the rectangular shape of buildings in the prediction that would be of great significance while performing vectorisation to get the area of roofs closer to true values. The method proposed analyses the performance of the segmentation algorithm over different built-up conditions, revealing more insights that would help in further data creation and model improvisation.
The model is built on satellite images from a source which might not perform in the same manner as with other sources. It would require retraining using transfer learning for satisfactory results. Nevertheless, the method would be the same as that followed in this case. The data creation is tedious and time consuming, which makes it tough to apply to entirely new sets of images. However, it's possible to avoid the data preparation task by making use of existing labelled data such as Inria aerial image data or SpaceNet etc. Another alternative would be to take cities in which all the buildings are mapped on open street maps and use them to create building masks, this route requires s certain amount of data cleaning and data alterations. Another limitation would be general to all deep learning models that involve images, processing power or computation. The model requires at least 6 GB of dedicated graphics card (GPU) memory and 16 GB of RAM to run effortlessly.

Conclusions
A deep convolutional neural network model was built in this study to extract building features from high-resolution satellite images. A manual method of careful digitisation is used to prepare the dataset required for training the neural network model. The model is then trained and its performance is evaluated using the transfer learning technique. The model's architecture, which includes a contracting and expanding path, ensures that it captures localisation and contextual information. The model's ability to extract building features differentiates it from other neural network architectures. The model's output image or prediction map has the same size, shape, coordinates, and datum as the input image, which is useful for further spatial analysis of the results. This ensures that the model can be applied to images from a variety of sensors as well as any geographical location. The model's segmentation output clearly shows that the majority of the buildings are extracted appropriately from high-resolution satellite images. For a clear understanding of output quality, the model's ability to predict in various built-up conditions is evaluated both visually and using the confusion matrix method. The experiment demonstrates that a model that predicts well in one area does not need to replicate the same results in different built-up environments. The visual observation suggests that the modified U-Net architecture could produce better building footprint predictions and evaluation metrics than other popular architectures by exhibiting higher accuracy parameters. This model would aid in the generation of building footprint datasets required for a variety of applications. The developed model is most useful in the change detection task because it performs exceptionally well in scenes where it has previously been trained. The study makes a successful attempt to demonstrate the use of deep learning methods to get insights into complex urban conditions. For all practical reasons, it is more unlikely that we get an 8 bit very high-resolution aerial image and reliance on satellite sources is the lucrative alternative to generate building or any city level datasets in a quick manner.
Artefacts occurring in places such as roads, closely spaced, and tiny buildings must be addressed in order to improve model performance. In the future, the neural network model's training can be enhanced with additional building annotations and corresponding images from a larger city area. The model's performance can be evaluated by increasing the training process in terms of image tile dimension and batch size. The model is computationally intensive and tailored to specific satellite sensor images; to replicate it at very high resolution from an aerial source, hyperparameters would need to be tweaked. The model is designed in such a way that it can be applied to images from various satellite sources by changing the hyper parameters without altering the core architecture. Following that, the model's architecture for semantic segmentation of satellite imagery into multiple classes must be tested, as well as the model's behaviour in cases of extraction of linear features such as roads and water bodies. The model developed in this work could be tested for performance in various urban environments across the country or continents.