Skip to content

Commit 53b3334

Browse files
latest papers 11-10 (#226)
1 parent 59fb4a5 commit 53b3334

File tree

1 file changed

+43
-7
lines changed

1 file changed

+43
-7
lines changed

README.md

Lines changed: 43 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,19 @@ This is the repo for our TMLR [code LLM survey](https://arxiv.org/abs/2311.07989
1919

2020
## News
2121

22-
🔥🔥🔥 [2025/10/30] Featured papers:
22+
🔥🔥🔥 [2025/11/10] Featured papers:
2323

24-
- 🔥🔥 [VisCoder2: Building Multi-Language Visualization Coding Agents](https://arxiv.org/abs/2510.23642) from University of Waterloo.
24+
- 🔥🔥 [SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models](https://arxiv.org/abs/2511.05459) from Kuaishou Technology.
2525

26-
- 🔥🔥 [JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence](https://arxiv.org/abs/2510.23538) from The University of Hong Kong.
26+
- 🔥🔥 [CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization](https://arxiv.org/abs/2511.01884) from University of Minnesota.
2727

28-
- 🔥🔥 [From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph](https://arxiv.org/abs/2510.19873) from Chinese Academy of Sciences.
28+
- 🔥🔥 [CodeClash: Benchmarking Goal-Oriented Software Engineering](https://arxiv.org/abs/2511.00839) from Stanford University.
2929

30-
- 🔥 [Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model](https://arxiv.org/abs/2510.18855) from Ant Group.
30+
- 🔥 [VisCoder2: Building Multi-Language Visualization Coding Agents](https://arxiv.org/abs/2510.23642) from University of Waterloo.
3131

32-
- 🔥 [TritonRL: Training LLMs to Think and Code Triton Without Cheating](https://arxiv.org/abs/2510.17891) from Carnegie Mellon University.
32+
- 🔥 [JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence](https://arxiv.org/abs/2510.23538) from The University of Hong Kong.
33+
34+
- 🔥 [From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph](https://arxiv.org/abs/2510.19873) from Chinese Academy of Sciences.
3335

3436
🔥🔥     [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!
3537

@@ -1133,7 +1135,9 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
11331135

11341136
84. **SwiftSolve**: "SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming" [2025-10] [[paper](https://arxiv.org/abs/2510.22626)]
11351137

1136-
85. **ReVeal**: "ReVeal: Self-Evolving Code Agents via Reliable Self-Verification" [2025-10] [[paper](https://arxiv.org/abs/2506.11442)]
1138+
85. **CodeClash**: "CodeClash: Benchmarking Goal-Oriented Software Engineering" [2025-11] [[paper](https://arxiv.org/abs/2511.00839)]
1139+
1140+
86. "A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks" [2025-10] [[paper](https://arxiv.org/abs/2511.00872)]
11371141

11381142
### 3.4 Interactive Coding
11391143

@@ -1561,6 +1565,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
15611565

15621566
- [**CUDA**] "From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph" [2025-10] [[paper](https://arxiv.org/abs/2510.19873)]
15631567

1568+
- [**CUDA**] "CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization" [2025-10] [[paper](https://arxiv.org/abs/2511.01884)]
1569+
15641570
## 5. Methods/Models for Downstream Tasks
15651571

15661572
For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
@@ -1723,6 +1729,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
17231729

17241730
- "Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models" [2025-10] [[paper](https://arxiv.org/abs/2510.14232)]
17251731

1732+
- "Gistify! Codebase-Level Understanding via Runtime Execution" [2025-10] [[paper](https://arxiv.org/abs/2510.26790)]
1733+
17261734
### Code RAG
17271735

17281736
- "CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation" [2024-05] [[paper](https://arxiv.org/abs/2405.02355)]
@@ -2115,6 +2123,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
21152123

21162124
- "InspectCoder: Dynamic Analysis-Enabled Self Repair through interactive LLM-Debugger Collaboration" [2025-10] [[paper](https://arxiv.org/abs/2510.18327)]
21172125

2126+
- "DPO-F+: Aligning Code Repair Feedback with Developers' Preferences" [2025-11] [[paper](https://arxiv.org/abs/2511.01043)]
2127+
21182128
### Code Similarity and Embedding (Clone Detection, Code Search)
21192129

21202130
- "Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations" [2020-09] [SIGIR 2021] [[paper](https://arxiv.org/abs/2009.02731)]
@@ -2795,6 +2805,12 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
27952805

27962806
- "MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL" [2025-10] [[paper](https://arxiv.org/abs/2510.25510)]
27972807

2808+
- "SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification" [2025-10] [[paper](https://arxiv.org/abs/2510.26840)]
2809+
2810+
- "SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps" [2025-10] [[paper](https://arxiv.org/abs/2510.27532)]
2811+
2812+
- "MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL" [2025-11] [[paper](https://arxiv.org/abs/2511.01008)]
2813+
27982814
### Program Proof
27992815

28002816
- "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03] [FSE 2023] [[paper](https://arxiv.org/abs/2303.04910)]
@@ -3427,6 +3443,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
34273443

34283444
- "ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs" [2025-10] [[paper](https://arxiv.org/abs/2510.10517)]
34293445

3446+
- "QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code" [2025-11] [[paper](https://arxiv.org/abs/2511.01183)]
3447+
34303448
### Binary Analysis and Decompilation
34313449

34323450
- "Using recurrent neural networks for decompilation" [2018-03] [SANER 2018] [[paper](https://ieeexplore.ieee.org/document/8330222)]
@@ -3573,6 +3591,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
35733591

35743592
- "Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach" [2025-09] [[paper](https://arxiv.org/abs/2509.21170)]
35753593

3594+
- "SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning" [2025-10] [[paper](https://arxiv.org/abs/2510.26457)]
3595+
3596+
- "Issue-Oriented Agent-Based Framework for Automated Review Comment Generation" [2025-11] [[paper](https://arxiv.org/abs/2511.00517)]
3597+
35763598
### Log Analysis
35773599

35783600
- "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08] [[paper](https://arxiv.org/abs/2208.10282)]
@@ -3689,6 +3711,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
36893711

36903712
- "MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction" [2025-10] [[paper](https://arxiv.org/abs/2510.09049)]
36913713

3714+
- "Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search" [2025-10] [[paper](https://arxiv.org/abs/2510.26287)]
3715+
36923716
### Software Modeling
36933717

36943718
- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)]
@@ -3949,6 +3973,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
39493973

39503974
- "Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study" [2025-03] [[paper](https://arxiv.org/abs/2503.15223)]
39513975

3976+
- "Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories" [2025-10] [[paper](https://arxiv.org/abs/2511.00197)]
3977+
39523978
### Hallucination
39533979

39543980
- "Exploring and Evaluating Hallucinations in LLM-Powered Code Generation" [2024-04] [[paper](https://arxiv.org/abs/2404.00971)]
@@ -3973,6 +3999,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
39733999

39744000
- "Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics" [2025-08] [[paper](https://arxiv.org/abs/2508.08661)]
39754001

4002+
- "A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI" [2025-11] [[paper](https://arxiv.org/abs/2511.00776)]
4003+
39764004
### Efficiency
39774005

39784006
- "EffiBench: Benchmarking the Efficiency of Automatically Generated Code" [2024-02] [NeurIPS 2024] [[paper](https://arxiv.org/abs/2402.02037)]
@@ -4243,6 +4271,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
42434271

42444272
- "Style2Code: A Style-Controllable Code Generation Framework with Dual-Modal Contrastive Representation Learning" [2025-05] [[paper](https://arxiv.org/abs/2505.19442)]
42454273

4274+
- "CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments" [2025-10] [[paper](https://arxiv.org/abs/2510.27565)]
4275+
42464276
## 7. Human-LLM Interaction
42474277

42484278
- "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models" [2022-04] [CHI EA 2022] [[paper](https://dl.acm.org/doi/abs/10.1145/3491101.3519665)]
@@ -4449,6 +4479,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
44494479

44504480
- **TREAT**: "TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework" [2025-10] [[paper](https://arxiv.org/abs/2510.17163)]
44514481

4482+
- **SWE-Compass**: "SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models" [2025-11] [[paper](https://arxiv.org/abs/2511.05459)]
4483+
44524484
#### Evaluation Metrics
44534485

44544486
- "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" [2020-09] [[paper](https://arxiv.org/abs/2009.10297)]
@@ -4588,6 +4620,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
45884620
| 2025-10 | arXiv | LiveOIBench | 403 | | "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)] [[data](https://liveoibench.github.io/)] |
45894621
| 2025-10 | arXiv | AutoCode | - | - | "AutoCode: LLMs as Problem Setters for Competitive Programming" [[paper](https://arxiv.org/abs/2510.12803)] |
45904622
| 2025-10 | arXiv | UniCode | 492 | - | "UniCode: A Framework for Generating High Quality Competitive Coding Problems" [[paper](https://arxiv.org/abs/2510.17868)] |
4623+
| 2025-10 | arXiv | RealClassEval | 400 | Python | "Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation" [[paper](https://arxiv.org/abs/2510.26130)] [[data](https://github.com/mrsumitbd/RealClassEval-Replication)] |
45914624

45924625
\* Automatically mined/human-annotated
45934626

@@ -4636,6 +4669,7 @@ $^\diamond$ Machine/human prompts
46364669
| 2025-05 | arXiv | CodeSense | 4495 | Python, C, Java | "CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning" [[paper](https://arxiv.org/abs/2506.00750)] [[data](https://codesense-bench.github.io/)] |
46374670
| 2025-07 | arXiv | CORE | 12,533 | C/C++, Java, Python | "CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks" [[paper](https://arxiv.org/abs/2507.05269)] [[data](https://corebench.github.io/)] |
46384671
| 2025-09 | arXiv | SWE-QA | 576 | Python | "SWE-QA: Can Language Models Answer Repository-level Code Questions?" [[paper](https://arxiv.org/abs/2509.14635)] [[data](https://github.com/peng-weihan/SWE-QA-Bench)] |
4672+
| 2025-11 | arXiv | VCode | 464 | SVG | "VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation" [[paper](https://arxiv.org/abs/2511.02778)] [[data](https://github.com/CSU-JPG/VCode)] |
46394673

46404674
#### Text-to-SQL
46414675

@@ -4896,6 +4930,8 @@ $^\diamond$ Machine/human prompts
48964930
| 2025-09 | arXiv | RepoDebug | 30696 | 8 | "RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models" [[paper](https://arxiv.org/abs/2509.04078)] |
48974931
| 2025-09 | arXiv | SWE-Bench Pro | 1865 | Python, Go, JS, TS | "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [[paper](https://arxiv.org/abs/2509.16941)] [[data](https://github.com/scaleapi/SWE-bench_Pro-os)] |
48984932
| 2025-10 | arXiv | E2EDev | 46 | Python | "E2Edev: Benchmarking Large Language Models in End-to-End Software Development Task" [[paper](https://arxiv.org/abs/2510.14509)] [[data](https://github.com/SCUNLP/E2EDev)] |
4933+
| 2025-11 | arXiv | SWE-Sharp-Bench | 150 | C# | "SWE-Sharp-Bench: A Reproducible Benchmark for C# Software Engineering Tasks" [[paper](https://arxiv.org/abs/2511.02352)] [[data](https://github.com/microsoft/prose/tree/main/misc/SWE-Sharp-Bench)] |
4934+
| 2025-11 | arXiv | CodeProjectEval | 18 | Python | "Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling" [[paper](https://arxiv.org/abs/2511.03404)] [[data](https://github.com/whisperzqh/ProjectGen)] |
48994935

49004936
\*Line Completion/API Invocation Completion/Function Completion
49014937

0 commit comments

Comments
 (0)