Can I contribute FlowRL - a new RL algorithm for LLM reasoning?

Hi maintainers,

I would like to contribute **FlowRL**, a new RL algorithm for LLM reasoning that uses **distribution matching** instead of **reward maximization**.

### Key idea
- Uses distribution matching (via flow balance) rather than reward maximization
- Achieves better generation diversity by avoiding single-peak convergence
- Improves policy generalization
- Potential to handle multiple diverse reward functions in the future

### Algorithm

$$
\mathcal{L}_{\text{FlowRL}} = w \cdot \left( \log Z_{\phi}(x) + \frac{1}{|y|} \log \pi_{\theta}(y \mid x) - \beta \hat{r}(x, y) - \frac{1}{|y|} \log \pi_{\text{ref}}(y \mid x) \right)^2
$$

### References
- 🤗 HuggingFace Paper : https://huggingface.co/papers/2509.15207
- 🔧 veRL official PR: https://github.com/volcengine/verl/pull/3924
- 💻 Source code: https://github.com/Xuekai-Zhu/FlowRL

<img width="1504" height="548" alt="Image" src="https://github.com/user-attachments/assets/8f9e450a-dc22-4d2a-9ce5-d920bb4685aa" />

Would this be a good fit for this repository? Happy to discuss implementation details!

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can I contribute FlowRL - a new RL algorithm for LLM reasoning? #48

Key idea

Algorithm

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Can I contribute FlowRL - a new RL algorithm for LLM reasoning? #48

Description

Key idea

Algorithm

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions