You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+43-7Lines changed: 43 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,17 +19,19 @@ This is the repo for our TMLR [code LLM survey](https://arxiv.org/abs/2311.07989
19
19
20
20
## News
21
21
22
-
🔥🔥🔥 [2025/10/30] Featured papers:
22
+
🔥🔥🔥 [2025/11/10] Featured papers:
23
23
24
-
- 🔥🔥 [VisCoder2: Building Multi-Language Visualization Coding Agents](https://arxiv.org/abs/2510.23642) from University of Waterloo.
24
+
- 🔥🔥 [SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models](https://arxiv.org/abs/2511.05459) from Kuaishou Technology.
25
25
26
-
- 🔥🔥 [JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence](https://arxiv.org/abs/2510.23538) from The University of Hong Kong.
26
+
- 🔥🔥 [CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization](https://arxiv.org/abs/2511.01884) from University of Minnesota.
27
27
28
-
- 🔥🔥 [From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph](https://arxiv.org/abs/2510.19873) from Chinese Academy of Sciences.
28
+
- 🔥🔥 [CodeClash: Benchmarking Goal-Oriented Software Engineering](https://arxiv.org/abs/2511.00839) from Stanford University.
29
29
30
-
- 🔥 [Every Step Evolves: Scaling Reinforcement Learning for Trillion-Scale Thinking Model](https://arxiv.org/abs/2510.18855) from Ant Group.
30
+
- 🔥 [VisCoder2: Building Multi-Language Visualization Coding Agents](https://arxiv.org/abs/2510.23642) from University of Waterloo.
31
31
32
-
- 🔥 [TritonRL: Training LLMs to Think and Code Triton Without Cheating](https://arxiv.org/abs/2510.17891) from Carnegie Mellon University.
32
+
- 🔥 [JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence](https://arxiv.org/abs/2510.23538) from The University of Hong Kong.
33
+
34
+
- 🔥 [From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph](https://arxiv.org/abs/2510.19873) from Chinese Academy of Sciences.
33
35
34
36
🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!
35
37
@@ -1133,7 +1135,9 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
1133
1135
1134
1136
84.**SwiftSolve**: "SwiftSolve: A Self-Iterative, Complexity-Aware Multi-Agent Framework for Competitive Programming" [2025-10][[paper](https://arxiv.org/abs/2510.22626)]
1135
1137
1136
-
85.**ReVeal**: "ReVeal: Self-Evolving Code Agents via Reliable Self-Verification" [2025-10][[paper](https://arxiv.org/abs/2506.11442)]
86. "A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks" [2025-10][[paper](https://arxiv.org/abs/2511.00872)]
1137
1141
1138
1142
### 3.4 Interactive Coding
1139
1143
@@ -1561,6 +1565,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
1561
1565
1562
1566
-[**CUDA**] "From Large to Small: Transferring CUDA Optimization Expertise via Reasoning Graph" [2025-10][[paper](https://arxiv.org/abs/2510.19873)]
1563
1567
1568
+
-[**CUDA**] "CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization" [2025-10][[paper](https://arxiv.org/abs/2511.01884)]
1569
+
1564
1570
## 5. Methods/Models for Downstream Tasks
1565
1571
1566
1572
For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
@@ -1723,6 +1729,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
1723
1729
1724
1730
- "Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models" [2025-10][[paper](https://arxiv.org/abs/2510.14232)]
1725
1731
1732
+
- "Gistify! Codebase-Level Understanding via Runtime Execution" [2025-10][[paper](https://arxiv.org/abs/2510.26790)]
- "DPO-F+: Aligning Code Repair Feedback with Developers' Preferences" [2025-11][[paper](https://arxiv.org/abs/2511.01043)]
2127
+
2118
2128
### Code Similarity and Embedding (Clone Detection, Code Search)
2119
2129
2120
2130
- "Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations" [2020-09][SIGIR 2021][[paper](https://arxiv.org/abs/2009.02731)]
@@ -2795,6 +2805,12 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
2795
2805
2796
2806
- "MTIR-SQL: Multi-turn Tool-Integrated Reasoning Reinforcement Learning for Text-to-SQL" [2025-10][[paper](https://arxiv.org/abs/2510.25510)]
2797
2807
2808
+
- "SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification" [2025-10][[paper](https://arxiv.org/abs/2510.26840)]
2809
+
2810
+
- "SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps" [2025-10][[paper](https://arxiv.org/abs/2510.27532)]
2811
+
2812
+
- "MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL" [2025-11][[paper](https://arxiv.org/abs/2511.01008)]
2813
+
2798
2814
### Program Proof
2799
2815
2800
2816
- "Baldur: Whole-Proof Generation and Repair with Large Language Models" [2023-03][FSE 2023][[paper](https://arxiv.org/abs/2303.04910)]
@@ -3427,6 +3443,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3427
3443
3428
3444
- "ECO: Enhanced Code Optimization via Performance-Aware Prompting for Code-LLMs" [2025-10][[paper](https://arxiv.org/abs/2510.10517)]
3429
3445
3446
+
- "QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code" [2025-11][[paper](https://arxiv.org/abs/2511.01183)]
3447
+
3430
3448
### Binary Analysis and Decompilation
3431
3449
3432
3450
- "Using recurrent neural networks for decompilation" [2018-03][SANER 2018][[paper](https://ieeexplore.ieee.org/document/8330222)]
@@ -3573,6 +3591,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3573
3591
3574
3592
- "Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach" [2025-09][[paper](https://arxiv.org/abs/2509.21170)]
3575
3593
3594
+
- "SecureReviewer: Enhancing Large Language Models for Secure Code Review through Secure-aware Fine-tuning" [2025-10][[paper](https://arxiv.org/abs/2510.26457)]
3595
+
3596
+
- "Issue-Oriented Agent-Based Framework for Automated Review Comment Generation" [2025-11][[paper](https://arxiv.org/abs/2511.00517)]
3597
+
3576
3598
### Log Analysis
3577
3599
3578
3600
- "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08][[paper](https://arxiv.org/abs/2208.10282)]
@@ -3689,6 +3711,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3689
3711
3690
3712
- "MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction" [2025-10][[paper](https://arxiv.org/abs/2510.09049)]
3691
3713
3714
+
- "Empowering RepoQA-Agent based on Reinforcement Learning Driven by Monte-carlo Tree Search" [2025-10][[paper](https://arxiv.org/abs/2510.26287)]
3715
+
3692
3716
### Software Modeling
3693
3717
3694
3718
- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12][[paper](https://arxiv.org/abs/2212.03404)]
@@ -3949,6 +3973,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3949
3973
3950
3974
- "Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study" [2025-03][[paper](https://arxiv.org/abs/2503.15223)]
3951
3975
3976
+
- "Understanding Code Agent Behaviour: An Empirical Study of Success and Failure Trajectories" [2025-10][[paper](https://arxiv.org/abs/2511.00197)]
3977
+
3952
3978
### Hallucination
3953
3979
3954
3980
- "Exploring and Evaluating Hallucinations in LLM-Powered Code Generation" [2024-04][[paper](https://arxiv.org/abs/2404.00971)]
@@ -3973,6 +3999,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3973
3999
3974
4000
- "Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics" [2025-08][[paper](https://arxiv.org/abs/2508.08661)]
3975
4001
4002
+
- "A Systematic Literature Review of Code Hallucinations in LLMs: Characterization, Mitigation Methods, Challenges, and Future Directions for Reliable AI" [2025-11][[paper](https://arxiv.org/abs/2511.00776)]
4003
+
3976
4004
### Efficiency
3977
4005
3978
4006
- "EffiBench: Benchmarking the Efficiency of Automatically Generated Code" [2024-02][NeurIPS 2024][[paper](https://arxiv.org/abs/2402.02037)]
@@ -4243,6 +4271,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4243
4271
4244
4272
- "Style2Code: A Style-Controllable Code Generation Framework with Dual-Modal Contrastive Representation Learning" [2025-05][[paper](https://arxiv.org/abs/2505.19442)]
- "Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models" [2022-04][CHI EA 2022][[paper](https://dl.acm.org/doi/abs/10.1145/3491101.3519665)]
@@ -4449,6 +4479,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4449
4479
4450
4480
-**TREAT**: "TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework" [2025-10][[paper](https://arxiv.org/abs/2510.17163)]
4451
4481
4482
+
-**SWE-Compass**: "SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models" [2025-11][[paper](https://arxiv.org/abs/2511.05459)]
4483
+
4452
4484
#### Evaluation Metrics
4453
4485
4454
4486
- "CodeBLEU: a Method for Automatic Evaluation of Code Synthesis" [2020-09][[paper](https://arxiv.org/abs/2009.10297)]
@@ -4588,6 +4620,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4588
4620
| 2025-10 | arXiv | LiveOIBench | 403 || "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)][[data](https://liveoibench.github.io/)]|
4589
4621
| 2025-10 | arXiv | AutoCode | - | - | "AutoCode: LLMs as Problem Setters for Competitive Programming" [[paper](https://arxiv.org/abs/2510.12803)]|
4590
4622
| 2025-10 | arXiv | UniCode | 492 | - | "UniCode: A Framework for Generating High Quality Competitive Coding Problems" [[paper](https://arxiv.org/abs/2510.17868)]|
0 commit comments