You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code](https://arxiv.org/abs/2311.07989) - a comprehensive review of LLM researches for code. Works in each category are ordered chronologically. If you have a basic understanding of machine learning but are new to NLP, we also provide a list of recommended readings in [section 9](#9-recommended-readings). If you refer to this repo, please cite:
7
+
This is the repo for our TMLR[code LLM survey](https://arxiv.org/abs/2311.07989). If you find this repo helpful, please support us by citing:
8
8
9
9
```
10
10
@article{zhang2024unifying,
@@ -19,7 +19,11 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
19
19
20
20
## News
21
21
22
-
🔥🔥🔥 [2025/10/13] Featured papers:
22
+
🔥🔥🔥 [2025/10/23] Featured papers:
23
+
24
+
- 🔥🔥 [Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model](https://arxiv.org/abs/2510.18855) from Ant Group.
25
+
26
+
- 🔥🔥 [TritonRL: Training LLMs to Think and Code Triton Without Cheating](https://arxiv.org/abs/2510.17891) from Carnegie Mellon University.
23
27
24
28
- 🔥 [LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?](https://arxiv.org/abs/2510.09595) from University of Michigan.
25
29
@@ -403,6 +407,8 @@ These LLMs are not specifically trained for code, but have demonstrated varying
403
407
404
408
91.**LLaDA-MoE**: "LLaDA-MoE: A Sparse MoE Diffusion Language Model" [2025-09][[paper](https://arxiv.org/abs/2509.24389)]
- "Interactive Program Synthesis" [2017-03][[paper](https://arxiv.org/abs/1703.03539)]
@@ -1219,6 +1229,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
1219
1229
1220
1230
- "SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement" [2025-09][[paper](https://arxiv.org/abs/2509.18808)]
1221
1231
1232
+
- "Benchmarking Correctness and Security in Multi-Turn Code Generation" [2025-10][[paper](https://arxiv.org/abs/2510.13859)]
1233
+
1222
1234
### 3.5 Frontend Navigation
1223
1235
1224
1236
- "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding" [2021-10][ACL 2022][[paper](https://arxiv.org/abs/2110.08518)]
@@ -1525,6 +1537,12 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
1525
1537
1526
1538
-[**CUDA**] "EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models" [2025-10][[paper](https://arxiv.org/abs/2510.03760)]
1527
1539
1540
+
-[**Verilog**] "Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code" [2025-10][[paper](https://arxiv.org/abs/2510.14756)]
1541
+
1542
+
-[**CUDA**] "Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization" [2025-10][[paper](https://arxiv.org/abs/2510.17158)]
1543
+
1544
+
-[**Triton**] "TritonRL: Training LLMs to Think and Code Triton Without Cheating" [2025-10][[paper](https://arxiv.org/abs/2510.17891)]
1545
+
1528
1546
## 5. Methods/Models for Downstream Tasks
1529
1547
1530
1548
For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
@@ -1685,6 +1703,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
1685
1703
1686
1704
- "LongCodeZip: Compress Long Context for Code Language Models" [2025-10][[paper](https://arxiv.org/abs/2510.00446)]
1687
1705
1706
+
- "Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models" [2025-10][[paper](https://arxiv.org/abs/2510.14232)]
@@ -1855,6 +1875,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
1855
1875
1856
1876
- "Function-to-Style Guidance of LLMs for Code Translation" [2025-07][ICML 2025][[paper](https://arxiv.org/abs/2507.11083)]
1857
1877
1878
+
- "EffiReasonTrans: RL-Optimized Reasoning for Code Translation" [2025-10][[paper](https://arxiv.org/abs/2510.18863)]
1879
+
1858
1880
### Code Commenting and Summarization
1859
1881
1860
1882
- "A Transformer-based Approach for Source Code Summarization" [2020-05][ACL 2020][[paper](https://arxiv.org/abs/2005.00653)]
@@ -1927,6 +1949,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
1927
1949
1928
1950
- "DocAgent: A Multi-Agent System for Automated Code Documentation Generation" [2025-04][[paper](https://arxiv.org/abs/2504.08725)]
1929
1951
1952
+
- "HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection" [2025-10][[paper](https://arxiv.org/abs/2510.17591)]
1953
+
1930
1954
### Program Repair
1931
1955
1932
1956
- "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair" [2021-02][ICSE 2021][[paper](https://arxiv.org/abs/2103.00073)]
@@ -2065,7 +2089,9 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
2065
2089
2066
2090
- "The Impact of Fine-tuning Large Language Models on Automated Program Repair" [2025-07][ICSME 2025][[paper](https://arxiv.org/abs/2507.19909)]
2067
2091
2068
-
- "How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair" [2025-08][ESEM 2025][[paper](https://arxiv.org/abs/2508.16499v1)]
2092
+
- "How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair" [2025-08][ESEM 2025][[paper](https://arxiv.org/abs/2508.16499)]
@@ -2389,6 +2419,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
2389
2419
2390
2420
- "WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code" [2025-06][ACL 2025 Findings][[paper](https://arxiv.org/abs/2506.07818)]
2391
2421
2422
+
- "A11YN: aligning LLMs for accessible web UI code generation" [2025-10][[paper](https://arxiv.org/abs/2510.13914)]
2423
+
2424
+
- "WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality" [2025-10][[paper](https://arxiv.org/abs/2510.18560)]
2425
+
2392
2426
### Automated Machine Learning
2393
2427
2394
2428
- "Large Language Models Synergize with Automated Machine Learning" [2024-05][[paper](https://arxiv.org/abs/2405.03727)]
@@ -2721,6 +2755,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
2721
2755
2722
2756
- "Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling" [2025-09][[paper](https://arxiv.org/abs/2509.24403)]
2723
2757
2758
+
- "MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training" [2025-10][[paper](https://arxiv.org/abs/2510.12831)]
2759
+
2760
+
- "Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL" [2025-10][[paper](https://arxiv.org/abs/2510.14296)]
2761
+
2724
2762
### Program Proof
2725
2763
2726
2764
- "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03][FSE 2023][[paper](https://arxiv.org/abs/2303.04910)]
@@ -3801,6 +3839,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3801
3839
3802
3840
- "Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks" [2025-10][[paper](https://arxiv.org/abs/2510.01359)]
3803
3841
3842
+
- "When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" [2025-10][[paper](https://arxiv.org/abs/2510.17862)]
@@ -4363,6 +4403,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4363
4403
4364
4404
-**LoCoBench**: "LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering" [2025-09][[paper](https://arxiv.org/abs/2509.09614)]
4365
4405
4406
+
-**TREAT**: "TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework" [2025-10][[paper](https://arxiv.org/abs/2510.17163)]
4407
+
4366
4408
#### Evaluation Metrics
4367
4409
4368
4410
- "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" [2020-09][[paper](https://arxiv.org/abs/2009.10297)]
@@ -4498,6 +4540,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4498
4540
| 2025-08 | arXiv | AutoCodeBench | 3,920 | 20 | "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators" [[paper](https://arxiv.org/abs/2508.09101)][[data](https://autocodebench.github.io/)]|
4499
4541
| 2025-08 | arXiv | AetherCode | 456 | C++ | "AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions" [[paper](https://arxiv.org/abs/2508.16402)][[data](https://huggingface.co/datasets/m-a-p/AetherCode)]|
4500
4542
| 2025-10 | arXiv | LiveOIBench | 403 || "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)][[data](https://liveoibench.github.io/)]|
4543
+
| 2025-10 | arXiv | AutoCode | - | - | "AutoCode: LLMs as Problem Setters for Competitive Programming" [[paper](https://arxiv.org/abs/2510.12803)]|
4544
+
| 2025-10 | arXiv | UniCode | 492 | - | "UniCode: A Framework for Generating High Quality Competitive Coding Problems" [[paper](https://arxiv.org/abs/2510.17868)]|
0 commit comments