latest papers 09-26 (#212)

Geralt-Targaryen · web-flow · commit 379640d6cad5 · 2025-09-27T21:58:01.000+08:00
diff --git a/README.md b/README.md
@@ -15,17 +15,13 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
 
 ## News
 
-🔥🔥🔥 [2025/09/22] Featured papers:
+🔥🔥🔥 [2025/09/26] Featured papers:
 
 - 🔥🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
 
-- 🔥🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University.
+- 🔥🔥 [SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?](https://arxiv.org/abs/2509.16941) from Scale AI.
 
-- 🔥 [LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering](https://arxiv.org/abs/2509.09614) from Salesforce AI Research.
-
-- 🔥 [Astra: A Multi-Agent System for GPU Kernel Performance Optimization](https://arxiv.org/abs/2509.07506) from Stanford University.
-
-- 🔥 [GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion](https://arxiv.org/abs/2509.05980) from Zhejiang University.
+- 🔥 [SWE-QA: Can Language Models Answer Repository-level Code Questions?](https://arxiv.org/abs/2509.14635) from Shanghai Jiao Tong University.
 
 🔥🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!
 
@@ -35,14 +31,10 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
 
 🔥🔥🔥 [2025/09/22] News from Codefuse
 
-- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-source models on [SWE-Bench leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)]
+- [CGM (Code Graph Model)](https://arxiv.org/abs/2505.16901) is accepted to NeurIPS 2025. CGM currently ranks 1st among open-weight models on [SWE-Bench-Lite leaderboard](https://www.swebench.com/). [[repo](https://github.com/codefuse-ai/CodeFuse-CGM)]
 
 - [GALLa: Graph Aligned Large Language Models](https://arxiv.org/abs/2409.04183) is accepted by ACL 2025 main conference. [[repo](https://github.com/codefuse-ai/GALLa)]
 
-<p align='center'>
-<img src='imgs/swe-leaderboard.png' style='width: 90%; '>
-</p>
-
 #### How to Contribute
 
 If you find a paper to be missing from this repository, misplaced in a category, or lacking a reference to its journal/conference information, please do not hesitate to create an issue.
@@ -693,6 +685,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 67. "SCoGen: Scenario-Centric Graph-Based Synthesis of Real-World Code Problems" [2025-09] [[paper](https://arxiv.org/abs/2509.14281)]
 
+68. "Verification Limits Code LLM Training" [2025-09] [[paper](https://arxiv.org/abs/2509.20837)]
+
 ### 2.5 Reinforcement Learning on Code
 
 1. **CompCoder**: "Compilable Neural Code Generation with Compiler Feedback" [2022-03] [ACL 2022] [[paper](https://arxiv.org/abs/2203.05132)]
@@ -761,6 +755,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 33. "Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.12434)]
 
+34. "DELTA-Code: How Does RL Unlock and Transfer New Programming Algorithms in LLMs?" [2025-09] [[paper](https://arxiv.org/abs/2509.21016)]
+
 ## 3. When Coding Meets Reasoning
 
 ### 3.1 Coding for Reasoning
@@ -1077,6 +1073,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 76. "GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging" [2025-08] [[paper](https://arxiv.org/abs/2508.18993)]
 
+77. **MapCoder-Lite**: "MapCoder-Lite: Squeezing Multi-Agent Coding into a Single Small LLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17489)]
+
 ### 3.4 Interactive Coding
 
 - "Interactive Program Synthesis" [2017-03] [[paper](https://arxiv.org/abs/1703.03539)]
@@ -1185,6 +1183,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 - "CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance" [2025-07] [[paper](https://arxiv.org/abs/2507.10646)]
 
+- "SR-Eval: Evaluating LLMs on Code Generation under Stepwise Requirement Refinement" [2025-09] [[paper](https://arxiv.org/abs/2509.18808)]
+
 ### 3.5 Frontend Navigation
 
 - "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding" [2021-10] [ACL 2022] [[paper](https://arxiv.org/abs/2110.08518)]
@@ -1295,6 +1295,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 - "UI-Venus Technical Report: Building High-performance UI Agents with RFT" [2025-08] [[paper](https://arxiv.org/abs/2508.10833)]
 
+- "Mano Report" [2025-09] [[paper](https://arxiv.org/abs/2509.17336)]
+
 ## 4. Code LLM for Low-Resource, Low-Level, and Domain-Specific Languages
 
 - [**Ruby**] "On the Transferability of Pre-trained Language Models for Low-Resource Programming Languages" [2022-04] [ICPC 2022] [[paper](https://arxiv.org/abs/2204.09653)]
@@ -1483,6 +1485,8 @@ These models apply Instruction Fine-Tuning techniques to enhance the capacities
 
 - [**CUDA**] "Astra: A Multi-Agent System for GPU Kernel Performance Optimization" [2025-09] [[paper](https://arxiv.org/abs/2509.07506)]
 
+- [**LaTeX**] "Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models" [2025-09] [[paper](https://arxiv.org/abs/2509.17589)]
+
 ## 5. Methods/Models for Downstream Tasks
 
 For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF, and (occasionally) static program analysis); the second column contains non-Transformer neural methods (e.g. LSTM, CNN, GNN); the third column contains Transformer based methods (e.g. BERT, GPT, T5).
@@ -2225,6 +2229,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "GRACE: Graph-Guided Repository-Aware Code Completion through Hierarchical Code Fusion" [2025-09] [[paper](https://arxiv.org/abs/2509.05980)]
 
+- "CodeRAG: Finding Relevant and Necessary Knowledge for Retrieval-Augmented Repository-Level Code Completion" [2025-09] [[paper](https://arxiv.org/abs/2509.16112)]
+
+- "RPG: A Repository Planning Graph for Unified and Scalable Codebase Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.16198)]
+
 ### Issue Resolution
 
 - "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" [2023-10] [ICLR 2024] [[paper](https://arxiv.org/abs/2310.06770)]
@@ -3183,6 +3191,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "Code-SPA: Style Preference Alignment to Large Language Models for Effective and Robust Code Debugging" [2025-07] [ACL 2025 Findings] [[paper](https://aclanthology.org/2025.findings-acl.912/)]
 
+- "LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code" [2025-09] [[paper](https://arxiv.org/abs/2509.17337)]
+
 ### Malicious Code Detection
 
 - "I-MAD: Interpretable Malware Detector Using Galaxy Transformer", 2019-09, Comput. Secur. 2021, [[paper](https://arxiv.org/abs/1909.06865)]
@@ -3337,6 +3347,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "Evaluating Generated Commit Messages with Large Language Models" [2025-07] [[paper](https://arxiv.org/abs/2507.10906)]
 
+- "CoRaCMG: Contextual Retrieval-Augmented Framework for Commit Message Generation" [2025-09] [[paper](https://arxiv.org/abs/2509.18337)]
+
 ### Code Review
 
 - "Using Pre-Trained Models to Boost Code Review Automation" [2022-01] [ICSE 2022] [[paper](https://arxiv.org/abs/2201.06850)]
@@ -3417,6 +3429,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects" [2025-09] [[paper](https://arxiv.org/abs/2509.14856)]
 
+- "Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach" [2025-09] [[paper](https://arxiv.org/abs/2509.21170)]
+
 ### Log Analysis
 
 - "LogStamp: Automatic Online Log Parsing Based on Sequence Labelling" [2022-08] [[paper](https://arxiv.org/abs/2208.10282)]
@@ -3707,6 +3721,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code" [2025-08] [[paper](https://arxiv.org/abs/2508.18106)]
 
+- "Localizing Malicious Outputs from CodeLLM" [2025-09] [[paper](https://arxiv.org/abs/2509.17070)]
+
 ### Correctness
 
 - "An Empirical Evaluation of GitHub Copilot's Code Suggestions" [2022-05] [MSR 2022] [[paper](https://ieeexplore.ieee.org/document/9796235)]
@@ -4201,6 +4217,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
 
 - "ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming" [2025-05] [ACL 2025] [[paper](https://arxiv.org/abs/2505.16667)]
 
+- "Intuition to Evidence: Measuring AI's True Impact on Developer Productivity" [2025-09] [[paper](https://arxiv.org/abs/2509.19708)]
+
 ## 8. Datasets
 
 ### 8.1 Pretraining
@@ -4692,6 +4710,7 @@ $^\diamond$ Machine/human prompts
 | 2025-07 | arXiv             | LiveRepoReflection | 1888                   | C++, Go, Java, JS, Python, Rust | "Turning the Tide: Repository-based Code Reflection" [[paper](https://arxiv.org/abs/2507.09866)]                                                                                                                                               |
 | 2025-07 | arXiv             | SWE-Perf           | 140                    | Python                          | "SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?" [2025-07] [[paper](https://arxiv.org/abs/2507.12415)] [[data](https://github.com/swe-perf/swe-perf)]                                                     |
 | 2025-09 | arXiv             | RepoDebug          | 30696                  | 8                               | "RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models" [[paper](https://arxiv.org/abs/2509.04078)]                                                                                          |
+| 2025-09 | arXiv             | SWE-Bench Pro      | 1865                   | Python, Go, JS, TS              | "SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?" [[paper](https://arxiv.org/abs/2509.16941)] [[data](https://github.com/scaleapi/SWE-bench_Pro-os)]                                                               |
 
 \*Line Completion/API Invocation Completion/Function Completion