RL-LLM-NLP

This repository encompasses libraries and papers on Reinforcement Learning (RL) within Large Language Models (LLM) and Natural Language Processing (NLP).

I consider RL to be a pivotal technology in the field of AI, and NLP (particularly LLM) to be a direction well worth exploring.

Library

GitHub	From	Year	Desc
PRIME	PRIME-RL	2025	Scalable RL solution for the advanced reasoning of language models
rStar	MicroSoft	2025
veRL	Bytedance	2024	Volcano Engine Reinforcement Learning for LLM
trl	HuggingFace	2024	Train LM with RL
RL4LMs	Allen	2023	RL library to fine-tune LM to human preferences
alignment-handbook	huggingface	2023	Robust recipes to align language models with human and AI preferences

Paper

Cate	Abbr	Title	From	Year	Link
RL	MRT	Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning	Carnegie Mellon	2025	paper, GitHub
RL	L1, LCPO	L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning	Carnegie Mellon	2025	paper, GitHub
RL	Online-DPO-R1	Online-DPO-R1: Unlocking Effective Reasoning Without the PPO Overhead	Salesforce AI Research	2025	paper, GitHub
RL	orz	Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model	StepFun	2025	paper, GitHub
RL	OREAL	Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning	InternLM	2025	paper, GitHub
RL	R1	DeepSeek-R1	DeepSeek	2025	paper, ①

o1	Sky-T1	Sky-T1: Train your own O1 preview model within $450	NovaSky-AI	2025	GitHub
o1	STILL	A series of technical report on Slow Thinking with LLM	RUCAIBox	2025	GitHub

RL Scaling	LIMR	LIMR: Less is More for RL Scaling	GAIR-NLP	2025	paper, GitHub
RL Scaling	DeepScaleR	DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL	Agentica	2025	paper, GitHub
RL Scaling	ScalingLaw	Value-Based Deep RL Scales Predictably	Berkeley	2025	paper

SLM	PRIME	Process Reinforcement through Implicit Rewards	PRIME-RL	2025	paper, GitHub
SLM	rStar-Math	rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking	MicroSoft	2025	paper, GitHub
SLM	rStar	rStar: Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers	MicroSoft	2024	paper, GitHub

Unlearn		A Closer Look at Machine Unlearning for Large Language Models	Sea AI	2024	paper, GitHub
Unlearn	Quark	Quark: Controllable Text Generation with Reinforced [Un]learning	Allen	2022	paper, GitHub

Align	ReMax	ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models	CUHK	2024	paper
Align		A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More	Salesforce	2024	paper
Align		Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback	Allen	2024	paper, GitHub
Align		Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey	Capital One	2024	paper
Align	RLHF	Training language models to follow instructions with human feedback	OpenAI	2022	paper
Align	NLPO	Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization	Allen	2022	paper, GitHub
Align		Fine-Tuning Language Models from Human Preferences	OpenAI	2020	paper, GitHub
Align	RLOO	Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs	Cohere	2024	paper

Policy	Dr. DAPO	Understanding R1-Zero-Like Training: A Critical Perspective	Sea AI Lab	2025	paper, GitHub
Policy	DAPO	DAPO: An Open-Source LLM Reinforcement Learning System at Scale	ByteDance Seed	2025	paper, GitHub
Policy	GRPO	DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models	DeepSeek	2024	paper
Policy	DPO	Direct Preference Optimization: Your Language Model is Secretly a Reward Model	Stanford	2024	paper
Policy		Decision Transformer: Reinforcement Learning via Sequence Modeling	Berkeley	2021	paper, GitHub
Policy	PPO	Proximal Policy Optimization Algorithms	OpenAI	2017	paper
Policy	REINFORCE multi-sample	Buy 4 Reinforce Samples, Get a Baseline for Free!	University of Amsterdam	2019	paper
Policy	REINFORCE	Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning	Northeastern University	1992	paper

Appendix

① DeepSeek-R1相关复现：

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL-LLM-NLP

Library

Paper

Appendix

About

Releases

Packages

License

hscspring/rl-llm-nlp

Folders and files

Latest commit

History

Repository files navigation

RL-LLM-NLP

Library

Paper

Appendix

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages