figshare
Browse

CMAB-The World's First National-Scale Multi-Attribute Building Dataset

Version 7 2025-04-20, 01:20
Version 6 2025-04-03, 04:09
Version 5 2025-03-21, 09:17
Version 4 2025-03-21, 05:54
Version 3 2025-03-21, 04:53
Version 2 2025-02-21, 07:24
Version 1 2024-12-26, 03:58
dataset
posted on 2025-04-20, 01:20 authored by Yecheng ZhangYecheng Zhang, Huimin Zhao, Ying Long

Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper presents the first national-scale Multi-Attribute Building dataset (CMAB) with artificial intelligence, covering 3,667 spatial cities, 31 million buildings, and 23.6 billion m² of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 363 billion m³ of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating morphology, location, and function features. Using multi-source data, including billions of remote sensing images and 60 million street view images (SVIs), we generated rooftop, height, structure, function, style, age, and quality attributes for each building with machine learning and large multimodal models. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.

Data records: A building dataset with a total rooftop area of 23.6 billion square meters in 3,667 natural cities in China, including the attribute of building rooftop, height, structure, function, age, style and quality, as well as the code files used to calculate these data. The deep learning models used are OCRNet, XGBoost, fine-tuned CLIP and Yolo-v8.

Supplementary note: The architectural structure, style, and quality are affected by the temporal and spatial distribution of street views in China. Regarding the recognition of building colors, we found that the existing CLIP series model can not accurately judge the composition and proportion of building colors, and then it will be accurately calculated and supplemented by semantic segmentation and image processing. Please contact zhangyec23@mails.tsinghua.edu.cn or ylong@tsinghua.edu.cn if you have any technical problems.

Reference Format: Zhang, Y., Zhao, H. & Long, Y. CMAB: A Multi-Attribute Building Dataset of China. Sci Data 12, 430 (2025). https://doi.org/10.1038/s41597-025-04730-5.

Funding

62394331 | National Natural Science Foundation of China

62394335 | National Natural Science Foundation of China

52178044 | National Natural Science Foundation of China

G-2306-34815 | Energy Foundation

History

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC