Toward Accurate and Practical AI Assistants in Software Development
Software development is a complex and essential process that underpins nearly every facet of modern life, from everyday applications to critical systems in healthcare, finance, transportation, and more. To support the software development process, artificial intelligence (AI)-powered assistants have emerged as valuable tools, offering assistance in various tasks such as code generation, program repair, reverse engineering, and beyond. While existing AI assistants show initial promise, they often struggle to provide accurate support to developers, frequently producing syntactically invalid or semantically incorrect code.
This dissertation proposes a solution to improve the accuracy, reliability, and practicality of AI assistants in supporting software development tasks: enhancing AI techniques with software domain knowledge. Software domain knowledge encompasses a wide range of aspects, including programming language syntax, project-specific context, and developers' working practices. Four general strategies are explored for injecting domain knowledge into AI models: (1) training on large-scale datasets that implicitly encode domain patterns, (2) designing model architectures that mirror programming language structure, (3) developing learning objectives that enforce syntax and semantics, and (4) applying domain-guided constraints during inference to steer model behavior.
This dissertation presents five research contributions across three critical tasks--program repair, code generation, and reverse engineering--to demonstrate the effectiveness of these strategies. The first two contributions, CURE and KNOD, introduce novel neural architectures that incorporate syntax and semantic constraints through pre-training and tree-based decoding to reduce invalid code generation in automated program repair. A large-scale empirical study further quantifies the performance gains of large language models (LLMs) in fixing bugs by leveraging pre-trained knowledge and domain-specific fine-tuning. In the domain of code generation, the LeDex pipeline enables LLMs to self-debug their own outputs via a combination of synthetic data, fine-tuning, and reinforcement learning, improving their ability to iteratively refine and correct code. Finally, Nova advances AI understanding of low-level binary code by introducing a hierarchical attention mechanism and contrastive learning objective to overcome challenges posed by Assembly code’s sparsity and compiler variability.
Together, these works show that incorporating domain knowledge can significantly improve the trustworthiness, correctness, and applicability of AI assistants in real-world development settings. This dissertation lays a foundation for building intelligent, domain-aware development tools and outlines key research directions--including repository-level reasoning, unified lifecycle assistance, and efficient model deployment for the next generation of AI-powered software engineering.
History
Degree Type
- Doctor of Philosophy
Department
- Computer Science
Campus location
- West Lafayette