Skip to content

Commit 8fc2008

Browse files
latest papers 10-23 (#218)
1 parent 1398021 commit 8fc2008

File tree

1 file changed

+48
-3
lines changed

1 file changed

+48
-3
lines changed

README.md

Lines changed: 48 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
<img src='imgs/wordcloud.png' style='width: 100%; '>
55
</p>
66

7-
This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code](https://arxiv.org/abs/2311.07989) - a comprehensive review of LLM researches for code. Works in each category are ordered chronologically. If you have a basic understanding of machine learning but are new to NLP, we also provide a list of recommended readings in [section 9](#9-recommended-readings). If you refer to this repo, please cite:
7+
This is the repo for our TMLR [code LLM survey](https://arxiv.org/abs/2311.07989). If you find this repo helpful, please support us by citing:
88

99
```
1010
@article{zhang2024unifying,
@@ -19,7 +19,11 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
1919

2020
## News
2121

22-
🔥🔥🔥 [2025/10/13] Featured papers:
22+
🔥🔥🔥 [2025/10/23] Featured papers:
23+
24+
- 🔥🔥 [Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model](https://arxiv.org/abs/2510.18855) from Ant Group.
25+
26+
- 🔥🔥 [TritonRL: Training LLMs to Think and Code Triton Without Cheating](https://arxiv.org/abs/2510.17891) from Carnegie Mellon University.
2327

2428
- 🔥 [LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?](https://arxiv.org/abs/2510.09595) from University of Michigan.
2529

@@ -403,6 +407,8 @@ These LLMs are not specifically trained for code, but have demonstrated varying
403407

404408
91. **LLaDA-MoE**: "LLaDA-MoE: A Sparse MoE Diffusion Language Model" [2025-09] [[paper](https://arxiv.org/abs/2509.24389)]
405409

410+
92. **Ring-1T**: "Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model" [2025-10] [[paper](https://arxiv.org/abs/2510.18855)]
411+
406412
### 2.2 Existing LLM Adapted to Code
407413

408414
These models are general-purpose LLMs further pretrained on code-related data.
@@ -777,6 +783,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
777783

778784
35. **Critique-Coder**: "Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning" [2025-09] [[paper](https://arxiv.org/abs/2509.22824)]
779785

786+
36. **CodeRL+**: "CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment" [2025-10] [[paper](https://arxiv.org/abs/2510.18471)]
787+
780788
## 3. When Coding Meets Reasoning
781789

782790
### 3.1 Coding for Reasoning
@@ -1109,6 +1117,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
11091117

11101118
81. **VeriGuard**: "VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation" [2025-10] [[paper](https://arxiv.org/abs/2510.05156)]
11111119

1120+
82. **KAT-Coder**: "KAT-Coder Technical Report" [2025-10] [[paper](https://arxiv.org/abs/2510.18779)]
1121+
11121122
### 3.4 Interactive Coding
11131123

11141124
- "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)]
@@ -1219,6 +1229,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
12191229

12201230
- "SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement" [2025-09] [[paper](https://arxiv.org/abs/2509.18808)]
12211231

1232+
- "Benchmarking Correctness and Security in Multi-Turn Code Generation" [2025-10] [[paper](https://arxiv.org/abs/2510.13859)]
1233+
12221234
### 3.5 Frontend Navigation
12231235

12241236
- "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding" [2021-10] [ACL 2022] [[paper](https://arxiv.org/abs/2110.08518)]
@@ -1525,6 +1537,12 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
15251537

15261538
- [**CUDA**] "EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models" [2025-10] [[paper](https://arxiv.org/abs/2510.03760)]
15271539

1540+
- [**Verilog**] "Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code" [2025-10] [[paper](https://arxiv.org/abs/2510.14756)]
1541+
1542+
- [**CUDA**] "Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization" [2025-10] [[paper](https://arxiv.org/abs/2510.17158)]
1543+
1544+
- [**Triton**] "TritonRL: Training LLMs to Think and Code Triton Without Cheating" [2025-10] [[paper](https://arxiv.org/abs/2510.17891)]
1545+
15281546
## 5. Methods/Models for Downstream Tasks
15291547

15301548
For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
@@ -1685,6 +1703,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
16851703

16861704
- "LongCodeZip: Compress Long Context for Code Language Models" [2025-10] [[paper](https://arxiv.org/abs/2510.00446)]
16871705

1706+
- "Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models" [2025-10] [[paper](https://arxiv.org/abs/2510.14232)]
1707+
16881708
### Code RAG
16891709

16901710
- "CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation" [2024-05] [[paper](https://arxiv.org/abs/2405.02355)]
@@ -1855,6 +1875,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
18551875

18561876
- "Function-to-Style Guidance of LLMs for Code Translation" [2025-07] [ICML 2025] [[paper](https://arxiv.org/abs/2507.11083)]
18571877

1878+
- "EffiReasonTrans: RL-Optimized Reasoning for Code Translation" [2025-10] [[paper](https://arxiv.org/abs/2510.18863)]
1879+
18581880
### Code Commenting and Summarization
18591881

18601882
- "A Transformer-based Approach for Source Code Summarization" [2020-05] [ACL 2020] [[paper](https://arxiv.org/abs/2005.00653)]
@@ -1927,6 +1949,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
19271949

19281950
- "DocAgent: A Multi-Agent System for Automated Code Documentation Generation" [2025-04] [[paper](https://arxiv.org/abs/2504.08725)]
19291951

1952+
- "HGAdapter: Hypergraph-based Adapters in Language Models for Code Summarization and Clone Detection" [2025-10] [[paper](https://arxiv.org/abs/2510.17591)]
1953+
19301954
### Program Repair
19311955

19321956
- "CURE: Code-Aware Neural Machine Translation for Automatic Program Repair" [2021-02] [ICSE 2021] [[paper](https://arxiv.org/abs/2103.00073)]
@@ -2065,7 +2089,9 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
20652089

20662090
- "The Impact of Fine-tuning Large Language Models on Automated Program Repair" [2025-07] [ICSME 2025] [[paper](https://arxiv.org/abs/2507.19909)]
20672091

2068-
- "How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair" [2025-08] [ESEM 2025] [[paper](https://arxiv.org/abs/2508.16499v1)]
2092+
- "How Small is Enough? Empirical Evidence of Quantized Small Language Models for Automated Program Repair" [2025-08] [ESEM 2025] [[paper](https://arxiv.org/abs/2508.16499)]
2093+
2094+
- "InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration" [2025-10] [[paper](https://arxiv.org/abs/2510.18327)]
20692095

20702096
### Code Similarity and Embedding (Clone Detection, Code Search)
20712097

@@ -2277,6 +2303,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
22772303

22782304
- "RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.16198)]
22792305

2306+
- "On Pretraining for Project-Level Code Completion" [2025-10] [[paper](https://arxiv.org/abs/2510.13697)]
2307+
22802308
### Issue Resolution
22812309

22822310
- "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [2023-10] [ICLR 2024] [[paper](https://arxiv.org/abs/2310.06770)]
@@ -2307,6 +2335,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
23072335

23082336
- "SWE-Bench-CL: Continual Learning for Coding Agents" [2025-06] [[paper](https://arxiv.org/abs/2507.00014)]
23092337

2338+
- "SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution" [2025-07] [[paper](https://arxiv.org/abs/2507.23348)]
2339+
23102340
- "SWE-Exp: Experience-Driven Software Issue Resolution" [2025-07] [[paper](https://arxiv.org/abs/2507.23361)]
23112341

23122342
- "Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning" [2025-08] [[paper](https://arxiv.org/abs/2508.03501)]
@@ -2389,6 +2419,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
23892419

23902420
- "WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code" [2025-06] [ACL 2025 Findings] [[paper](https://arxiv.org/abs/2506.07818)]
23912421

2422+
- "A11YN: aligning LLMs for accessible web UI code generation" [2025-10] [[paper](https://arxiv.org/abs/2510.13914)]
2423+
2424+
- "WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality" [2025-10] [[paper](https://arxiv.org/abs/2510.18560)]
2425+
23922426
### Automated Machine Learning
23932427

23942428
- "Large Language Models Synergize with Automated Machine Learning" [2024-05] [[paper](https://arxiv.org/abs/2405.03727)]
@@ -2721,6 +2755,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
27212755

27222756
- "Agentar-Scale-SQL: Advancing Text-to-SQL through Orchestrated Test-Time Scaling" [2025-09] [[paper](https://arxiv.org/abs/2509.24403)]
27232757

2758+
- "MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training" [2025-10] [[paper](https://arxiv.org/abs/2510.12831)]
2759+
2760+
- "Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL" [2025-10] [[paper](https://arxiv.org/abs/2510.14296)]
2761+
27242762
### Program Proof
27252763

27262764
- "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)]
@@ -3801,6 +3839,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
38013839

38023840
- "Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks" [2025-10] [[paper](https://arxiv.org/abs/2510.01359)]
38033841

3842+
- "When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" [2025-10] [[paper](https://arxiv.org/abs/2510.17862)]
3843+
38043844
### Correctness
38053845

38063846
- "An Empirical Evaluation of GitHub Copilot's Code Suggestions" [2022-05] [MSR 2022] [[paper](https://ieeexplore.ieee.org/document/9796235)]
@@ -4363,6 +4403,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
43634403

43644404
- **LoCoBench**: "LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering" [2025-09] [[paper](https://arxiv.org/abs/2509.09614)]
43654405

4406+
- **TREAT**: "TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework" [2025-10] [[paper](https://arxiv.org/abs/2510.17163)]
4407+
43664408
#### Evaluation Metrics
43674409

43684410
- "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" [2020-09] [[paper](https://arxiv.org/abs/2009.10297)]
@@ -4498,6 +4540,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
44984540
| 2025-08 | arXiv | AutoCodeBench | 3,920 | 20 | "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators" [[paper](https://arxiv.org/abs/2508.09101)] [[data](https://autocodebench.github.io/)] |
44994541
| 2025-08 | arXiv | AetherCode | 456 | C++ | "AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions" [[paper](https://arxiv.org/abs/2508.16402)] [[data](https://huggingface.co/datasets/m-a-p/AetherCode)] |
45004542
| 2025-10 | arXiv | LiveOIBench | 403 | | "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)] [[data](https://liveoibench.github.io/)] |
4543+
| 2025-10 | arXiv | AutoCode | - | - | "AutoCode: LLMs as Problem Setters for Competitive Programming" [[paper](https://arxiv.org/abs/2510.12803)] |
4544+
| 2025-10 | arXiv | UniCode | 492 | - | "UniCode: A Framework for Generating High Quality Competitive Coding Problems" [[paper](https://arxiv.org/abs/2510.17868)] |
45014545

45024546
\* Automatically mined/human-annotated
45034547

@@ -4804,6 +4848,7 @@ $^\diamond$ Machine/human prompts
48044848
| 2025-07 | arXiv | SWE-Perf | 140 | Python | "SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?" [2025-07] [[paper](https://arxiv.org/abs/2507.12415)] [[data](https://github.com/swe-perf/swe-perf)] |
48054849
| 2025-09 | arXiv | RepoDebug | 30696 | 8 | "RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models" [[paper](https://arxiv.org/abs/2509.04078)] |
48064850
| 2025-09 | arXiv | SWE-Bench Pro | 1865 | Python, Go, JS, TS | "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [[paper](https://arxiv.org/abs/2509.16941)] [[data](https://github.com/scaleapi/SWE-bench_Pro-os)] |
4851+
| 2025-10 | arXiv | E2EDev | 46 | Python | "E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task" [[paper](https://arxiv.org/abs/2510.14509)] [[data](https://github.com/SCUNLP/E2EDev)] |
48074852

48084853
\*Line Completion/API Invocation Completion/Function Completion
48094854

0 commit comments

Comments
 (0)