DemoCraft: Using In-Context Learning to Improve Code Generation in LLMs

Kapu, Nirmal Joshua; Sreejith, Mihit

doi:10.6084/m9.figshare.27310776.v1

DemoCraft: Using In-Context Learning to Improve Code Generation in LLMs

poster

posted on 2024-10-27, 04:13 authored by Nirmal Joshua KapuNirmal Joshua Kapu, Mihit Sreejith

Generating executable code from natural language instructions using Large Language Models (LLMs) poses challenges such as semantic ambiguity and understanding task-specific contexts. To address these issues, we propose a system called DemoCraft, which enhances code generation by leveraging in-context learning and demonstration selection, combined with latent concept learning. Latent concept learning introduces additional concept tokens, which are trainable embeddings that capture task-specific knowledge. These tokens are integrated into the model’s input space, enabling the model to effectively identify and select optimal demonstrations for a given task. This approach is grounded in the principles of latent variable models, where taskspecific latent parameters d encapsulate complex contextual information. These concept tokens refine the model’s prediction process, ensuring task-specific knowledge is applied during code generation. This study evaluates the impact of these techniques on code generation, specifically using the SantaCoder model, tested on the MBPP and HumanEval datasets. Our methodology is structured into four phases: latent concept learning, demonstration selection, output formatting, and code evaluation. Demonstration selection, a critical step, optimizes the model’s generalization capabilities by identifying examples that best infer the task concepts. We address this by investigating two methods: latent concept selection, where demonstrations are chosen based on learned embeddings, and random selection. We also ensure that the model’s outputs conform to syntactic and semantic correctness through output formatting procedures. The generated code is rigorously evaluated using metrics such as correctness@k, similarity@k, and pass@k. Our experiments demonstrate a near 2x improvement across these metrics, underscoring the role of latent concept learning and demonstration selection in improving the efficiency, accuracy, and adaptability of the SantaCoder model in real-world code generation tasks.

History

Usage metrics

Keywords

large language models in context learning machine learning latent concept learning nl2code code generation

Licence

CC BY 4.0

DemoCraft: Using In-Context Learning to Improve Code Generation in LLMs

History

Usage metrics

Categories

Keywords

Licence

Exports