-
Notifications
You must be signed in to change notification settings - Fork 122
Open
Description
Hi maintainers,
I would like to contribute FlowRL, a new RL algorithm for LLM reasoning that uses distribution matching instead of reward maximization.
Key idea
- Uses distribution matching (via flow balance) rather than reward maximization
- Achieves better generation diversity by avoiding single-peak convergence
- Improves policy generalization
- Potential to handle multiple diverse reward functions in the future
Algorithm
References
- 🤗 HuggingFace Paper : https://huggingface.co/papers/2509.15207
- 🔧 veRL official PR: [recipe] feat: add FlowRL recipe volcengine/verl#3924
- 💻 Source code: https://github.com/Xuekai-Zhu/FlowRL
Would this be a good fit for this repository? Happy to discuss implementation details!
Thanks!
Metadata
Metadata
Assignees
Labels
No labels