figshare
Browse

DataRecipe

Download all (12.58 MB) This item is shared privately
journal contribution
modified on 2024-06-18, 03:07
<h2>*This is a public release for a study related to Data Quality on training CodeLLMs for Code Generation.</h2><h2><b>Replication Package (datarecipe)</b></h2><p><br></p><ol><li>Basic Setup<br>> Install torch (https://pytorch.org/get-started/locally/) based on your device type, and install the other packages using:<br>>> pip install -e .<br></li><li>Dataset Download<br>> Each dataset has to be downloaded to local directories to load.<br><br>The datasets can be found at:<br>MBPP: https://huggingface.co/datasets/google-research-datasets/mbpp<br>CoNaLa: https://huggingface.co/datasets/neulab/conala<br>CodeAlpaca: https://huggingface.co/datasets/antolin/codealpaca-filtered<br>DS-1000: https://huggingface.co/datasets/xlangai/DS-1000<br>ODEX: https://huggingface.co/datasets/loubnabnl/odex-data<br></li><li>Information to replicate</li></ol><ul><li>To replicate:</li><li><ul><li><b>fine-tuning</b> each dataset, we used:</li><li><ul><li><b>Commands</b>:<br>CUDA_VISIBLE_DEVICES= GPU_NUM \<br>python finetuning/train.py \<br>--model_ckpt ...../MODEL/PATH \<br>--dataset DATASET_PATH \<br>--max_length 512 \<br>--num_epochs 50 \<br>--optim OPTIMIZER \<br>--learning_rate LEARNING_RATE \<br>--weight_decay WEIGHT_DECAY \<br>--eval_freq EVAL_FREQ \<br>--batch_size BATCH_SIZE \<br>--gradient_accumulation_steps<br></li></ul></li><li>performing <b>inference</b> using the checkpoints (i.e., fine-tuned models), we used:</li><li><ul><li><b>./runs/dataset_name.sh</b> script files.</li><li>we provide the example of hyperparameters inside script files but it should be considered when the experimental environment is different.</li></ul></li></ul></li></ul><ul><li><ul><li>the source code we primarily used:</li><li><ul><li>train.py: the main code for training/validating.</li><li>main.py: the main code for testing.</li><li>utils.py: the utils and processing-related code.</li></ul></li></ul></li></ul><p><br></p><p><br></p><h4><b>Characteristic Analysis</b></h4><h4><b>Setup</b></h4><ol><li>Download the package and uncompress the .zip file.</li><li>Install the libraries.<br>[command] <b>pip install -r requirements.txt</b></li></ol><p dir="ltr">Instruction for each analysis</p><ol><li>Analysis based on conventional NLP techniques<br>[command] <b>python nlpAnalysis.py</b><br>[results] The results will be in nlpAnalysis_result.csv</li><li>Code-side issue extraction by using SonarQube<br>[Instruction]<br>- Install SonarQube<br>- Create projects for each dataset<br>- Use <b>convertFromOriginalToPy.py</b> to convert the original dataset to a complete Python file<br>- Upload the input files to SonarQube<br>- Retrieve the SonarQube analysis report<br>[command] <b>python convertFromOriginalToPy.py</b><br>[results] The output files will be saved in the <i>dataset_before_refactoring</i> directory.</li><li>Automated code refactoring<br>[instruction] each library can be used for selected <i>file</i>.py<br>[command]<br>1) cp -r dataset_before_refactoring dataset_after_refactoring<br>2) autopep8 --in-place <i>file</i>.py<br>3) black <i>file</i>.py<br>4) yapf -i <i>file</i>.py<br>5) autoflake --in-place --remove-unused-variable <i>file</i>.py<br>6) docformatter --in-place <i>file</i>.py<br>7) unify --in-place <i>file</i>.py<br>[results] The output files will be saved in <i>dataset_after_refactoring</i> directory.</li><li>Understandability measurement<br>[command] <b>python pylintUnderstandabilityScore.py</b><br>[results] The results will be retrieved in <i>UnderstandabilityBeforeRefactoring.csv and UnderstandabilityAfterRefactoring.csv</i></li></ol><p><br></p>