Online RL for LLMs means you are sampling from the model, scoring immediately, a...

		whimsicalism on Dec 7, 2024 \| parent \| context \| favorite \| on: OpenAI Reinforcement Fine-Tuning Research Program Online RL for LLMs means you are sampling from the model, scoring immediately, and passing gradients back to the model. As opposed to, sampling from the model a bunch, getting scores offline, and then fine tuning the model on those offline scored generations.