Skip to content

Commit 21cd972

Browse files
latest papers 10-13 (#215)
1 parent 3f038f2 commit 21cd972

File tree

1 file changed

+15
-10
lines changed

1 file changed

+15
-10
lines changed

README.md

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
1515

1616
## News
1717

18-
🔥🔥🔥 [2025/10/11] Featured papers:
18+
🔥🔥🔥 [2025/10/13] Featured papers:
1919

20-
- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.
21-
22-
- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
20+
- 🔥🔥 [LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?](https://arxiv.org/abs/2510.09595) from University of Michigan.
2321

24-
- 🔥 [BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software](https://arxiv.org/abs/2509.25248) from Arizona State University.
22+
- 🔥🔥 [Scaling Laws for Code: A More Data-Hungry Regime](https://arxiv.org/abs/2510.08702) from Harbin Institute of Technology.
2523

26-
- 🔥 [Devstral: Fine-tuning Language Models for Coding Agent Applications](https://arxiv.org/abs/2509.25193) from Mistral AI.
24+
- 🔥🔥 [BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution](https://arxiv.org/abs/2510.08697) from Monash University.
2725

28-
- 🔥 [LLaDA-MoE: A Sparse MoE Diffusion Language Model](https://arxiv.org/abs/2509.24389) from Ant Group.
29-
30-
- 🔥 [Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents](https://arxiv.org/abs/2509.23045) from Moonshot AI.
26+
- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
3127

32-
- 🔥 [ML2B: Multi-Lingual ML Benchmark For AutoML](https://arxiv.org/abs/2509.22768) from HSE University.
28+
- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.
3329

3430
🔥🔥     [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!
3531

@@ -521,6 +517,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained
521517

522518
27. **Mellum**: "Mellum: Production-Grade in-IDE Contextual Code Completion with Multi-File Project Understanding" [2025-10] [[paper](https://arxiv.org/abs/2510.05788)]
523519

520+
28. "Scaling Laws for Code: A More Data-Hungry Regime" [2025-10] [[paper](https://arxiv.org/abs/2510.08702)]
521+
524522
#### Encoder-Decoder
525523

526524
1. **PyMT5** (Span Corruption): "PyMT5: multi-mode translation of natural language and Python code with transformers" [2020-10] [EMNLP 2020] [[paper](https://arxiv.org/abs/2010.03150)]
@@ -3603,6 +3601,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
36033601

36043602
- "When Names Disappear: Revealing What LLMs Actually Understand About Code" [2025-10] [[paper](https://arxiv.org/abs/2510.03178)]
36053603

3604+
- "MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction" [2025-10] [[paper](https://arxiv.org/abs/2510.09049)]
3605+
36063606
### Software Modeling
36073607

36083608
- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12] [[paper](https://arxiv.org/abs/2212.03404)]
@@ -4391,6 +4391,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
43914391

43924392
- "CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks" [2025-07] [[paper](https://arxiv.org/abs/2507.10535)]
43934393

4394+
- "BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution" [2025-10] [[paper](https://arxiv.org/abs/2510.08697)]
4395+
4396+
- "How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective" [2025-10] [[paper](https://arxiv.org/abs/2510.08720)]
4397+
43944398
#### Program Synthesis
43954399

43964400
| Date | Venue | Benchmark | Size | Language | Source |
@@ -4485,6 +4489,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
44854489
| 2025-08 | arXiv | FPBench | 1800 | Python | "Refining Critical Thinking in LLM Code Generation: A Faulty Premise-based Evaluation Framework" [[paper](https://arxiv.org/abs/2508.03622)] [[data](https://github.com/JialinLi13/FaultyPremise)] |
44864490
| 2025-08 | arXiv | AutoCodeBench | 3,920 | 20 | "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators" [[paper](https://arxiv.org/abs/2508.09101)] [[data](https://autocodebench.github.io/)] |
44874491
| 2025-08 | arXiv | AetherCode | 456 | C++ | "AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions" [[paper](https://arxiv.org/abs/2508.16402)] [[data](https://huggingface.co/datasets/m-a-p/AetherCode)] |
4492+
| 2025-10 | arXiv | LiveOIBench | 403 | | "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)] [[data](https://liveoibench.github.io/)] |
44884493

44894494
\* Automatically mined/human-annotated
44904495

0 commit comments

Comments
 (0)