This repository contains resources referenced in the paper: A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages.
- [2025] π Survey paper is accepted at ACM Transactions on Software Engineering and Methodology (TOSEM)
- [2024] π Survey paper submitted: "A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages"
- [2024] π Comprehensive analysis of 111 papers covering 40+ programming languages
Large Language Models (LLMs) have shown remarkable capabilities in code generation for popular programming languages. However, their performance in Low-Resource Programming Languages (LRPLs) and Domain-Specific Languages (DSLs) remains a critical challenge.
Our survey provides a systematic review of 111 papers filtered from over 27,000 published studies from 2020-2024, investigating the capabilities and limitations of LLMs in these specialized domains. We identify four main evaluation techniques, categorize enhancement methods into six groups, and analyze dataset curation approaches for LRPLs and DSLs.
Figure 1: Performance comparison between High-Resource Programming Languages (HRPLs) and Low-Resource Programming Languages (LRPLs) on the MultiPL-E benchmark. The heatmap shows significant performance disparities, between for example, Python and Rust.
Our systematic investigation addresses three key research questions:
RQ1: Which LLMs, Metrics and Benchmarks are used to evaluate code generation in LRPL and DSL Domains?
LLMs have been widely used for code generation with many models appearing on leaderboards, but most research focuses on popular languages dominated by Python. It is not clear which LLMs have been used for LRPLs and DSLs, and what evaluation metrics and benchmark datasets are used for these languages. Understanding this information helps assess LLM capabilities and identify whether new metrics are needed for LRPLs and DSLs.
RQ2: What strategies and methodologies have been proposed in the literature to enhance the performance of LLMs for code generation in LRPLs and DSLs?
Enhancing LLM performance in LRPL and DSL settings is essential for bridging capability gaps that prevent significant developer populations from leveraging AI-assisted coding tools. Specialized domains and resource-constrained languages present unique challenges, such as limited training data and highly specialized syntax, which general-purpose LLMs may not effectively address. Understanding proposed strategies and methodologies is crucial for identifying effective approaches and guiding future research.
RQ3: How are datasets for LRPLs and DSLs collected, processed, and utilized to support code generation tasks using LLMs?
High-quality datasets are fundamental to training LLMs for code generation tasks. However, LRPLs and DSLs often suffer from data scarcity and imbalance, which can significantly hinder LLM performance. Understanding the methodologies employed to collect, process, and utilize datasets for these specialized languages is critical for addressing data-related challenges and ensuring LLMs can generate accurate and reliable code.
- Comprehensive Literature Review (RQ1): Systematic analysis of 111 papers filtered from 27,000+ studies (2020-2024)
- LLM & Evaluation Analysis (RQ1): Identification of models, metrics, and benchmarks used across LRPL/DSL domains
- Enhancement Methodology Taxonomy (RQ2): Six categories of techniques with effectiveness analysis
- Dataset Curation Analysis (RQ3): Comprehensive review of data collection, processing, and utilization strategies
- Performance Gap Analysis: Quantitative comparison showing significant disparities between HRPLs and LRPLs/DSLs
- Research Roadmap: Identification of challenges and opportunities for future research directions
| Metric | Count |
|---|---|
| Total Papers Reviewed | 111 |
| Initial Paper Pool | 27,000+ |
| Time Period | 2020-2024 |
| Programming Languages | 40+ |
| Research Venues | 39 |
| LRPL Papers | 51 |
| DSL Papers | 59 |
| Both LRPL & DSL | 1 |
| Technique | Paper Count |
|---|---|
| Fine-tuning | 48 |
| Prompting Strategies | 25 |
| Pre-training | 22 |
| Iterative Feedback | 11 |
| RAG | 6 |
| Decoding | 5 |
| DSL Creation | 4 |
| Novel Architectures | 3 |
| Knowledge Distillation | 2 |
| Model Family | Usage Count |
|---|---|
| LLaMA Family | 14 |
| DeepSeek Family | 10 |
| StarCoder Family | 9 |
| CodeGen | 6 |
| CodeQwen | 4 |
| T5 | 3 |
| Mistral | 3 |
| CodeT5 | 3 |
| CodeGPT | 2 |
| GPT-2 | 2 |
| CodeGeeX | 2 |
| Others | 18 |
| Paper | Benchmark | Lang. | Model | Metric | Base | Finetuned |
|---|---|---|---|---|---|---|
| [128] | HumanEval | Rust | Starcoder | pass@1 | 21.8 | 23.4 |
| [85] | Regex-turk | Regex | T5 | acc. | 58.0 | 64.2 |
| [85] | FOL-mnli | FOL | T5 | acc. | 46.9 | 53.9 |
| [85] | FOL-codesc | FOL | T5 | acc. | 58.6 | 59.0 |
| [85] | LTL-synthesis | LTL | T5 | acc. | 87.5 | 87.9 |
| [187] | MultiPL-E | Rust | CodeLLaMA-PY | pass@1 | 27.0 | 40.3 |
| [176] | HumanEval-Haskell | Haskell | CodeGPT | ExMatch | 23.2 | 40.0 |
| [19] | Thakur-et-al. | Verilog | LLaMA2 | pass@5 | 41.2 | 70.6 |
| [15] | MCEval | Rust | CodeQwen-1.5 | pass@1 | 47.2 | 67.9 |
| [64] | HumanEval-Kotlin | Kotlin | CodeLLAMA | pass@1 | 26.1 | 42.2 |
| [64] | HumanEval-KOTLIN | Kotlin | DeepSeek | pass@1 | 41.0 | 55.3 |
| Metrics | Languages | Papers |
|---|---|---|
| pass@k | Awk, Bash, Codon, CoffeeScript, Crystal, D, Dart, Elixir, Erlang, Fortran, Go, Groovy, Haskell, Julia, Kotlin, Lean, Lua, Nim, OCaml, Pascal, Perl, PHP, PowerShell, R, Racket, Ruby, Rust, Scala, Scheme, Swift, Tcl, Verilog, VHDL, Vim script, F#, Terraform | [6, 13β15, 19, 25, 30, 41, 47, 62, 64, 78, 80, 102, 113, 114, 128, 130, 131, 137, 140, 142, 143, 149, 170, 177, 187, 192, 193, 208, 210, 215] [103, 205] |
| BLEU | Ansible, Assembly, Bash, Codon, Crystal, CQL, D, Fortran, GitHub Actions YAML, Haskell, Kotlin, LLVM IR, Nim, PowerShell, Ruby, Rust, Scala, Swift, Verilog | [26, 53, 92, 107, 109, 140, 148, 162, 195, 199, 207, 215] |
| ROUGE | Assembly, Codon, Crystal, D, Fortran, Haskell, Julia, Kotlin, Lua, Nim, PowerShell, R, Ruby, Rust, Scala, Swift | [13, 92, 109, 140, 195] |
| Edit Similarity | Kotlin, Rust, Scala, Ruby, Haskell, Bash | [92, 109, 162, 176] |
| Exact Match | Kotlin, Rust, Scala, Ruby, Regex, FOL, LTL, Ansible, Assembly, LLVM IR, CAD, Haskell, R, OCL, CQL, Bash, YAML | [2, 26, 53, 85, 92, 125, 139, 146, 148, 174, 176, 195, 215] |
| METEOR | Kotlin, Rust, Scala, Ruby, Assembly, PowerShell | [92, 109, 195] |
| Metric | Language(s) | Papers |
|---|---|---|
| Ansible Aware Metric | Ansible | [146, 148] |
| Schema Correct Metric | Ansible | [146, 148] |
| Command Accuracy (CMD Acc) | Bash | [146, 215] |
| Accuracy per Sequence | Multiple | [85] |
| Semantic Accuracy | Regex | [85, 115] |
| Pass@(scenario) | Verilog | [170] |
| Syn-VCS, Syn-DC | Verilog | [114] |
| Power-Performance-Area (PPA) | Verilog | [50, 117, 122, 172] |
| Execution Accuracy | Bash, CQL | [53, 179] |
| Logical Equivalence (LE) | FOL | [199] |
| Verify@k | F* | [16] |
| Benchmark Name | Languages | Papers | Research Utilization |
|---|---|---|---|
| MultiPL-E | Bash, Lua, Perl, R, Ruby, Racket, D, Go, Julia, Rust, Scala, Swift | [13, 14, 57, 137, 140, 187, 188, 190] [205] | Cross-language comparison; knowledge transfer studies |
| xCodeEval | Kotlin, Ruby, Rust | [102] | LRPL-focused evaluation; executable code testing |
| BabelCode | Dart, Lua, Rust, C#, R, Julia, and Haskell | [137] | Language diversity studies; syntax transfer analysis |
| VerilogEval | Verilog | [25, 78, 80, 113, 114, 130, 142, 149, 172, 177, 208, 210] | Standard Verilog benchmark; hardware LLM evaluation |
| RTLLM | Verilog | [19, 25, 78, 114, 117, 172, 210] | Hardware design; RTL generation |
| TLDR | Bash | [146, 215] | Command generation; system administration |
| FIMO | Lean | [41, 191] | Theorem proving; formal math |
| FOLIO | FOL | [136, 199] | Natural language to logic; reasoning tasks |
| Approach | Description | Example Papers |
|---|---|---|
| Existing Datasets | Direct use of pre-existing datasets | [176], [136], [165] |
| Modified Existing Datasets | Adaptation of existing datasets for specific goals | [146], [195], [196] |
| Collected Datasets | Gathering data from various sources | [57], [64], [190] |
| Source Type | Examples | Languages | Papers |
|---|---|---|---|
| Code Repositories | GitHub, GitLab | Most LRPLs/DSLs | [4, 6, 76, 147, 192] |
| Educational Resources | Textbooks, HDLBits, University courses | R, Verilog, Lean | [125], [170], [36], [191] |
| Programming Contests | Codeforces, contest platforms | Multiple | [102], [140] |
| Specialized Sources | Stack Overflow, Technical forums | Multiple | [28], [57] |
| Generator Model | Target Languages | Papers | Approach |
|---|---|---|---|
| GPT-3.5/4 | Verilog, Kotlin, FOL | [113], [64], [199] | Problem-solution generation |
| Claude3-Haiku | Verilog | [25] | Code-to-code augmentation |
| StarCoder-15B | Multiple LRPLs | [13] | Cross-language translation |
| DeepSeek | Lean | [191] | Quality classification |
- Initial Filtering: File extensions, size limits, license compatibility
- Deduplication: MinHash, ROUGE-L based approaches
- Fine-grained Filtering: Syntax validation, domain-specific rules
- Code Extraction and Cleaning: Comment separation, structure preservation
- Quality Checks: Compilation tests, static analysis
- Dataset-specific Processing: Domain adaptations
- Decontamination: Benchmark data removal
-
Rust:
- [128] OctoPack: Instruction tuning code large language models - Muennighoff et al., 2023 - Paper
- [30] Assessing Code Generation with Intermediate Languages - Deng et al., 2024 - Paper
- [47] ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation - Ren et al., 2024 - Paper
- [187] MagiCoder: Empowering Code Generation with OSS-INSTRUCT - Wei et al., 2024 - Paper
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
- [176] Investigating the performance of language models for completing code in functional programming languages: a haskell case study - Van Dam et al., 2024 - Paper
- [15] McEval: Massively Multilingual Code Evaluation - Chai et al., 2025 - Paper
- [64] Kotlin ML Pack: Technical Report - Titov et al., 2024 - Paper
-
D:
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
- [188] Batched Low-Rank Adaptation of Foundation Models - Wen & Chaudhuri, 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
-
Nim:
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
-
Crystal:
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
-
Swift:
- [6] Multi-lingual evaluation of code generation models - Athiwaratkun et al., 2023 - Paper
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
- [143] HumanEval-XL: A Multilingual Code Generation Benchmark - Peng et al., 2024 - Paper
-
Dart:
- [137] Measuring the impact of programming language distribution - Orlanski et al., 2023 - Paper
-
Kotlin:
- [6] Multi-lingual evaluation of code generation models - Athiwaratkun et al., 2023 - Paper
- [64] Kotlin ML Pack: Technical Report - Titov et al., 2024 - Paper
- [92] Language models for code completion: A practical evaluation - Izadi et al., 2024 - Paper
- [102] XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark - Khan et al., 2024 - Paper
- [143] HumanEval-XL: A Multilingual Code Generation Benchmark - Peng et al., 2024 - Paper
-
Ruby:
- [6] Multi-lingual evaluation of code generation models - Athiwaratkun et al., 2023 - Paper
- [82] MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion - Gong et al., 2022 - Paper
- [92] Language models for code completion: A practical evaluation - Izadi et al., 2024 - Paper
- [102] XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark - Khan et al., 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
-
Perl:
-
Lua:
- [13] Knowledge transfer from high-resource to low-resource programming languages for code llms - Cassano et al., 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
- [137] Measuring the impact of programming language distribution - Orlanski et al., 2023 - Paper
-
PowerShell:
-
Haskell:
- [137] Measuring the impact of programming language distribution - Orlanski et al., 2023 - Paper
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
- [176] Investigating the performance of language models for completing code in functional programming languages: a haskell case study - Van Dam et al., 2024 - Paper
-
Scala:
- [92] Language models for code completion: A practical evaluation - Izadi et al., 2024 - Paper
- [102] XCodeEval: An Execution-based Large Scale Multilingual Multitask Benchmark - Khan et al., 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
- [143] HumanEval-XL: A Multilingual Code Generation Benchmark - Peng et al., 2024 - Paper
-
Racket:
-
OCaml:
- [13] Knowledge transfer from high-resource to low-resource programming languages for code llms - Cassano et al., 2024 - Paper
-
R:
- [13] Knowledge transfer from high-resource to low-resource programming languages for code llms - Cassano et al., 2024 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
- [33] GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization in Programming Language Understanding - Diera et al., 2023 - Paper
- [125] User Centric Evaluation of Code Generation Tools - Miah & Zhu, 2024 - Paper
- [147] Time-Efficient Code Completion Model for the R Programming Language - Popov et al., 2021 - Paper
- [118] Python is Not Always the Best Choice: Embracing Multilingual Program of Thoughts - Luo et al., 2024 - Paper
-
Julia:
- [12] A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages - Buscemi, 2023 - Paper
- [79] Evaluation of openai codex for hpc parallel programming models kernel generation - Godoy et al., 2023 - Paper
- [137] Measuring the impact of programming language distribution - Orlanski et al., 2023 - Paper
- [14] MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation - Cassano et al., 2023 - Paper
-
Fortran:
- [66] Scope is all you need: Transforming LLMs for HPC Code - Kadosh et al., 2023 - Paper
- [79] Evaluation of openai codex for hpc parallel programming models kernel generation - Godoy et al., 2023 - Paper
- [140] IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators - Paul et al., 2024 - Paper
- [159] S3LLM: Large-Scale Scientific Software Understanding with LLMs - Shaik et al., 2024 - Paper
- Bash:
- [146] DocCGen: Document-based Controlled Code Generation - Pimparkhede et al., 2024 - Paper
- [162] ShellGPT: Generative Pre-trained Transformer Model for Shell Language Understanding - Shi et al., 2023 - Paper
- [179] Tackling Execution-Based Evaluation for NL2Bash - Vo et al., 2024 - Paper
- [196] InterCode: standardizing and benchmarking interactive coding with execution feedback - Yang et al., 2023 - Paper
- [215] DocPrompting: Generating Code by Retrieving the Docs - Zhou et al., 2023 - Paper
-
Verilog:
- [19] Data is all you need: Finetuning llms for chip design via an automated design-data augmentation framework - Chang et al., 2024 - Paper
- [25] Origen: Enhancing rtl code generation with code-to-code augmentation and self-reflection - Cui et al., 2024 - Paper
- [78] Autovcoder: A systematic framework for automated verilog code generation using llms - Gao et al., 2024 - Paper
- [80] From English to ASIC: Hardware Implementation with Large Language Model - Goh et al., 2024 - Paper
- [113] Verilogeval: Evaluating large language models for verilog code generation - Liu et al., 2023 - Paper
- [114] RTLCoder: Outperforming GPT-3.5 in Design RTL Generation - Liu et al., 2024 - Paper
- [117] Rtllm: An open-source benchmark for design rtl generation with large language model - Lu et al., 2024 - Paper
- [130] A multi-expert large language model architecture for verilog code generation - Nadimi & Zheng, 2024 - Paper
- [142] BetterV: controlled verilog generation with discriminative guidance - Pei et al., 2024 - Paper
- [149] AutoBench: Automatic Testbench Generation and Evaluation Using LLMs for HDL Design - Qiu et al., 2024 - Paper
- [170] Benchmarking large language models for automated verilog rtl code generation - Thakur et al., 2023 - Paper
- [172] Advanced Large Language Model (LLM)-Driven Verilog Development - Thorat et al., 2023 - Paper
- [177] VHDLEval: A Framework for Evaluating Large Language Models in VHDL Code Generation - Vijayaraghavan et al., 2024 - Paper
- [208] MG-Verilog: Multi-grained Dataset Towards Enhanced LLM-assisted Verilog Generation - Zhang et al., 2024 - Paper
- [210] CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization - Zhao et al., 2024 - Paper
-
VHDL:
- [177] VHDLEval: A Framework for Evaluating Large Language Models in VHDL Code Generation - Vijayaraghavan et al., 2024 - Paper
-
System Verilog:
- [75] Hardware phi-1.5 b: A large language model encodes hardware domain specific knowledge - Fu et al., 2024 - Paper
- [99] (Security) Assertions by Large Language Models - Kande et al., 2024 - Paper
- [121] Chiraag: Chatgpt informed rapid and automated assertion generation - Mali et al., 2024 - Paper
- [194] AssertLLM: Generating Hardware Verification Assertions from Design Specifications via Multi-LLMs - Yan et al., 2025**: - Paper
-
Ansible:
- [132] KubePlaybook: A Repository of Ansible Playbooks for Kubernetes Auto-Remediation with LLMs - Namrud et al., 2024 - Paper
- [146] DocCGen: Document-based Controlled Code Generation - Pimparkhede et al., 2024 - Paper
- [148] Automated Code Generation for Information Technology Tasks in YAML through Large Language Models - Pujar et al., 2025 - Paper
- [154] Ansible lightspeed: A code generation service for it automation - Sahoo et al., 2024 - Paper
-
YAML:
- [207] On the effectiveness of large language models for github workflows - Zhang et al., 2024 - Paper
-
Terraform HCL:
- [103] Iac-eval: A code generation benchmark for cloud infrastructure-as-code programs - Kon et al., 2024 - Paper
-
Lean:
- [36] Towards a Mathematics Formalisation Assistant using Large Language Models - Agrawal et al., 2022 - Paper
- [41] FIMO: A Challenge Formal Dataset for Automated Theorem Proving - Liu et al., 2023 - Paper
- [191] Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data - Xin et al., 2024 - Paper
-
Coq:
- [74] Enhancing Formal Theorem Proving: A Comprehensive Dataset for Training AI Models on Coq Code - Florath, 2024 - Paper
-
F*:
- [16] Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming - Chakraborty et al., 2025 - Paper
-
Verus:
- [200] Leveraging Large Language Models for Automated Proof Synthesis in Rust - Yao et al., 2023 - Paper
-
UCLID5:
- [126] Synthetic programming elicitation for text-to-code in very low-resource programming and formal languages - Mora et al., 2024 - Paper
-
FOL:
- [85] Formal Specifications from Natural Language - Hahn et al., 2022 - Paper
- [136] LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers - Olausson et al., 2023 - Paper
- [199] Harnessing the Power of Large Language Models for Natural Language to First-Order Logic Translation - Yang et al., 2024 - Paper
-
LTL:
-
Regex:
-
Excel Formulas:
- [97] Flame: A small language model for spreadsheet formulas - Joshi et al., 2024 - Paper
-
CAD Sketches:
- [139] Sketchgen: Generating constrained cad sketches - Para et al., 2021 - Paper
-
XDL (Chemistry):
- [165] Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting - Skreta et al., 2023 - Paper
-
PDDL:
-
SMILES:
- [180] Grammar prompting for domain-specific language generation with large language models - Wang et al., 2023 - Paper
-
Sathvik Joel
- π§ ksjoe30@gmail.com
-
Jie JW Wu (Repo Maintainer)
- π§ jie.jw.wu@ubc.ca
-
Fatemeh Fard
- π§ fatemeh.fard@ubc.ca
If you find this work useful, please cite the following reference:
@article{joel2024survey,
title = {A Survey on LLM-based Code Generation for Low-Resource and Domain-Specific Programming Languages},
author = {Joel, Sathvik and Wu, Jie JW and Fard, Fatemeh H.},
journal = {arXiv preprint arXiv:2410.03981},
year = {2024},
url = {https://arxiv.org/abs/2410.03981v3}
}