This repository encompasses libraries and papers on Reinforcement Learning (RL) within Large Language Models (LLM) and Natural Language Processing (NLP).
I consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring.
GitHub | From | Year | Desc |
---|---|---|---|
PRIME | PRIME-RL | 2025 | Scalable RL solution for the advanced reasoning of language models |
rStar | MicroSoft | 2025 | |
veRL | Bytedance | 2024 | Volcano Engine Reinforcement Learning for LLM |
trl | HuggingFace | 2024 | Train LM with RL |
RL4LMs | Allen | 2023 | RL library to fine-tune LM to human preferences |
alignment-handbook | huggingface | 2023 | Robust recipes to align language models with human and AI preferences |
Cate | Abbr | Title | From | Year | Link |
---|---|---|---|---|---|
RL | MRT | Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning | Carnegie Mellon | 2025 | paper, GitHub |
RL | L1, LCPO | L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning | Carnegie Mellon | 2025 | paper, GitHub |
RL | Online-DPO-R1 | Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead | Salesforce AI Research | 2025 | paper, GitHub |
RL | orz | Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | StepFun | 2025 | paper, GitHub |
RL | OREAL | Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning | InternLM | 2025 | paper, GitHub |
RL | R1 | DeepSeek-R1 | DeepSeek | 2025 | paper, ① |
o1 | Sky-T1 | Sky-T1: Train your own O1 preview model within $450 | NovaSky-AI | 2025 | GitHub |
o1 | STILL | A series of technical report on Slow Thinking with LLM | RUCAIBox | 2025 | GitHub |
RL Scaling | LIMR | LIMR: Less is More for RL Scaling | GAIR-NLP | 2025 | paper, GitHub |
RL Scaling | DeepScaleR | DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL | Agentica | 2025 | paper, GitHub |
RL Scaling | ScalingLaw | Value-Based Deep RL Scales Predictably | Berkeley | 2025 | paper |
SLM | PRIME | Process Reinforcement through Implicit Rewards | PRIME-RL | 2025 | paper, GitHub |
SLM | rStar-Math | rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking | MicroSoft | 2025 | paper, GitHub |
SLM | rStar | rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers | MicroSoft | 2024 | paper, GitHub |
Unlearn | A Closer Look at Machine Unlearning for Large Language Models | Sea AI | 2024 | paper, GitHub | |
Unlearn | Quark | Quark: Controllable Text Generation with Reinforced [Un]learning | Allen | 2022 | paper, GitHub |
Align | ReMax | ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models | CUHK | 2024 | paper |
Align | A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More | Salesforce | 2024 | paper | |
Align | Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback | Allen | 2024 | paper, GitHub | |
Align | Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey | Capital One | 2024 | paper | |
Align | RLHF | Training language models to follow instructions with human feedback | OpenAI | 2022 | paper |
Align | NLPO | Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization | Allen | 2022 | paper, GitHub |
Align | Fine-Tuning Language Models from Human Preferences | OpenAI | 2020 | paper, GitHub | |
Align | RLOO | Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs | Cohere | 2024 | paper |
Policy | Dr. DAPO | Understanding R1-Zero-Like Training: A Critical Perspective | Sea AI Lab | 2025 | paper, GitHub |
Policy | DAPO | DAPO: An Open-Source LLM Reinforcement Learning System at Scale | ByteDance Seed | 2025 | paper, GitHub |
Policy | GRPO | DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models | DeepSeek | 2024 | paper |
Policy | DPO | Direct Preference Optimization: Your Language Model is Secretly a Reward Model | Stanford | 2024 | paper |
Policy | Decision Transformer: Reinforcement Learning via Sequence Modeling | Berkeley | 2021 | paper, GitHub | |
Policy | PPO | Proximal Policy Optimization Algorithms | OpenAI | 2017 | paper |
Policy | REINFORCE multi-sample | Buy 4 Reinforce Samples, Get a Baseline for Free! | University of Amsterdam | 2019 | paper |
Policy | REINFORCE | Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning | Northeastern University | 1992 | paper |
- ① DeepSeek-R1相关复现:
- Jiayi-Pan/TinyZero: Clean, minimal, accessible reproduction of DeepSeek R1-Zero
- huggingface/open-r1: Fully open reproduction of DeepSeek-R1
- hkust-nlp/simpleRL-reason: This is a replicate of DeepSeek-R1-Zero and DeepSeek-R1 training on small models with limited data
- ZihanWang314/RAGEN: RAGEN is the first open-source reproduction of DeepSeek-R1 on AGENT training.