You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-10Lines changed: 15 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,21 +15,17 @@ This is the repo for our [TMLR](https://jmlr.org/tmlr/) survey [Unifying the Per
15
15
16
16
## News
17
17
18
-
🔥🔥🔥 [2025/10/11] Featured papers:
18
+
🔥🔥🔥 [2025/10/13] Featured papers:
19
19
20
-
- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.
21
-
22
-
- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
20
+
- 🔥🔥 [LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?](https://arxiv.org/abs/2510.09595) from University of Michigan.
23
21
24
-
- 🔥[BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software](https://arxiv.org/abs/2509.25248) from Arizona State University.
22
+
- 🔥🔥 [Scaling Laws for Code: A More Data-Hungry Regime](https://arxiv.org/abs/2510.08702) from Harbin Institute of Technology.
25
23
26
-
- 🔥[Devstral: Fine-tuning Language Models for Coding Agent Applications](https://arxiv.org/abs/2509.25193) from Mistral AI.
24
+
- 🔥🔥 [BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution](https://arxiv.org/abs/2510.08697) from Monash University.
27
25
28
-
- 🔥 [LLaDA-MoE: A Sparse MoE Diffusion Language Model](https://arxiv.org/abs/2509.24389) from Ant Group.
29
-
30
-
- 🔥 [Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents](https://arxiv.org/abs/2509.23045) from Moonshot AI.
26
+
- 🔥 [CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects](https://arxiv.org/abs/2509.14856) from Ant Group.
31
27
32
-
- 🔥[ML2B: Multi-Lingual ML Benchmark For AutoML](https://arxiv.org/abs/2509.22768) from HSE University.
28
+
- 🔥🔥 [EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models](https://arxiv.org/abs/2510.03760) from City University of Hong Kong.
33
29
34
30
🔥🔥 [2025/08/24] 29 papers from ICML 2025 have been added. Search for the keyword "ICML 2025"!
35
31
@@ -521,6 +517,8 @@ These models are Transformer encoders, decoders, and encoder-decoders pretrained
28. "Scaling Laws for Code: A More Data-Hungry Regime" [2025-10][[paper](https://arxiv.org/abs/2510.08702)]
521
+
524
522
#### Encoder-Decoder
525
523
526
524
1.**PyMT5** (Span Corruption): "PyMT5: multi-mode translation of natural language and Python code with transformers" [2020-10][EMNLP 2020][[paper](https://arxiv.org/abs/2010.03150)]
@@ -3603,6 +3601,8 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
3603
3601
3604
3602
- "When Names Disappear: Revealing What LLMs Actually Understand About Code" [2025-10][[paper](https://arxiv.org/abs/2510.03178)]
3605
3603
3604
+
- "MEC3O: Multi-Expert Consensus for Code Time Complexity Prediction" [2025-10][[paper](https://arxiv.org/abs/2510.09049)]
3605
+
3606
3606
### Software Modeling
3607
3607
3608
3608
- "Towards using Few-Shot Prompt Learning for Automating Model Completion" [2022-12][[paper](https://arxiv.org/abs/2212.03404)]
@@ -4391,6 +4391,10 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
4391
4391
4392
4392
- "CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks" [2025-07][[paper](https://arxiv.org/abs/2507.10535)]
4393
4393
4394
+
- "BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution" [2025-10][[paper](https://arxiv.org/abs/2510.08697)]
4395
+
4396
+
- "How Many Code and Test Cases Are Enough? Evaluating Test Cases Generation from a Binary-Matrix Perspective" [2025-10][[paper](https://arxiv.org/abs/2510.08720)]
4397
+
4394
4398
#### Program Synthesis
4395
4399
4396
4400
| Date | Venue | Benchmark | Size | Language | Source |
@@ -4485,6 +4489,7 @@ For each task, the first column contains non-neural methods (e.g. n-gram, TF-IDF
| 2025-08 | arXiv | AutoCodeBench | 3,920 | 20 | "AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators" [[paper](https://arxiv.org/abs/2508.09101)][[data](https://autocodebench.github.io/)]|
4487
4491
| 2025-08 | arXiv | AetherCode | 456 | C++ | "AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions" [[paper](https://arxiv.org/abs/2508.16402)][[data](https://huggingface.co/datasets/m-a-p/AetherCode)]|
4492
+
| 2025-10 | arXiv | LiveOIBench | 403 || "LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?" [[paper](https://arxiv.org/abs/2510.09595)][[data](https://liveoibench.github.io/)]|
0 commit comments