<p dir="ltr">Traditional 3D geographic scene reconstruction methods rely on single data sources for geometric restoration, struggling to effectively integrate non-geometric information like text and semantics. This severely hinders reconstruction feasibility in scenarios involving sparse data or non-geometric descriptions. Advances in generative artificial intelligence (AI) for multi-modal data understanding and generation offer new solutions. However, applying generative AI to this task still faces two core challenges: the conflict between model output randomness and reconstruction accuracy, and between model generality and the generation of diverse, professional content within geographic scenes. To address these challenges, this study proposes a framework guided by geographic entity semantics and powered by multi-modal generative models. It first integrates multi-modal inputs for semantic enhancement, then generates 3D entities under geometric and domain-enhancement constraints, and finally completes scene integration using spatial pose constraints. A case study demonstrates the method's versatility across various data combinations. It not only reconstructs scenes from text-only inputs, addressing the feasibility challenge, but also generates high-precision professional entities via rapid domain enhancement. Furthermore, adding geometric constraints yields significant improvements in reconstruction accuracy and visual realism, with user evaluations confirming its comprehensive superiority over traditional parametric methods.</p>