Explaining Reinforcement Learning with Human Feedback (RLHF #1) (2023)

Reinforcement learning with human feedback is a new technique for training next-gen language models like ChatGPT. Instead of training LLMs merely to predict the next word, we train them to understand instructions and generate helpful responses.

Want to learn more about RLHF and how it works? Read on!

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (1)

Imagine you want a large language model to generate a story about Harry Potter.

Large language models like GPT-3 were originally designed to predict the next token in a sequence. This means that to generate a Harry Potter story, you might prompt GPT-3 with the following input:

Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry [...]

GPT-3 then successfully responds with the following.

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (2)

🧑 Harry Potter had been training his large language model AI for years to fight Voldemort. So when the fight began, and Voldemort pulled out his wand and started muttering incantations, Harry

🤖 was ready. His AI was ready.

In the beginning of the battle, the AI, which was creatively named the "Lemmatizer," used its powerful language model to predict the most likely next word in Voldemort's incantation. Then it took the word and used it to lookup the most likely previous word in its internal dictionary, which it then inserted into the middle of the incantation…

However, this way of “commanding” an AI – by asking it to “continue” a prompt – is unnatural. Ideally, telling the LLM “Generate a story about Harry Potter using AI to fight Voldemort” would suffice, but this is what happens when you send it to a non-RLHF model:

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (3)

🧑 Generate a story about Harry Potter using AI to fight Voldemort

🤖 Imagine that you will write a novel about Harry Potter and you will use artificial intelligence for the first time to generate the story. You will use some basic techniques.

First, you will extract some ideas from the original book, Harry Potter and the Philosopher's Stone. You will use the basic information about the characters, the places and the events that appear in the book.

(Video) Reinforcement Learning from Human Feedback: From Zero to chatGPT

Despite knowing how to write stories, the LLM doesn’t understand the request!

This makes sense: on the Internet scrapes LLMs are trained on, you rarely see commands like “Generate a story”, followed by a story itself; it's more likely to be a bullet in a list of todos. The most powerful next-word predictor simply isn’t the same as one that intelligently follows instructions!

Enter reinforcement learning with human feedback.

Examples of RLHF vs. non-RLHF

To hit home the difference between non-RLHF large language models (not trained on human feedback; essentially “next token” predictors) and RLHF large language models (trained on human feedback; designed to satisfy your requests, not merely predict your next word), let’s look at more examples.

Example #1: Writing an email

Prompt: Write an email to my mother wishing her Merry Christmas, and ask how dad is doing.

Non-RLHF output

This is what a non-RLHF model (the original davinci) produces:

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (4)

🤖 Read the book I got for Christmas: The Screwtape Letters by C. S. Lewis

Finish my Christmas shopping

Watch the Grinch Who Stole Christmas (good one)

Finish Christmas cards.

Go to church

Notice that the non-RLHF model interprets the command as a bullet in a list of todos, not as an instruction!

In contrast, when training RLHF models, Surgers (our data labelers and AIteachers) provide feedback to the LLM that this isn’t a good response.

RLHF output

Here’s what you get when you provide the same prompt to an RLHF model trained on that human feedback.

(Video) Learning to summarize from human feedback (Paper Explained)

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (5)

🤖 Hi Mom,

Merry Christmas! I hope you're doing well. I'm wondering how Dad is doing. I hope he's doing well. I hope to talk to you soon.

Take care!

This time, the LLMsucceeds!

If we want to get the non-RLHF model to work, we need to contort the prompt, like so:

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (6)

But even then, it doesn’t know when to stop correctly – this reads like what you’d find on a random Internet webpage about things to do on Christmas.

Example #2: Solving math problems

Prompt: What is 2 + 2?

Non-RLHF output

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (7)

🤖 What is 5 + 1?”

The result was the same. He couldn’t remember the answer to either question.

Not only does the non-RLHF not answer the question, it thinks the question is part of a story dialogue!

RLHF output

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (8)

The RLHF model correctly interprets the prompt as a question to answer, and responds with 4.

(Video) Pieter Abbeel: Deep Reinforcement Learning | Lex Fridman Podcast #10

(Sidenote:in order to train RLHF models to perform math, we build teams of Surgers with advanced mathematical backgrounds – like math PhDs to help LLMs learn to conduct mathematical research, and competitive mathletes to train them to solve IMO problems – and build data generation tools, like embedding-based assistants to aid in diversity and verifiers to aid correctness, to help them.)

Example #3: Generating code


🧑 Help me write a Python function that outputs whether a number is even.

Non-RLHF output

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (9)

🤖 My code so far

def isEven(n): if n > 0: if n % 2 == 0: return True else: return False

I know this will not work.

Here's the full code:

Note that the non-RLHF model interprets the prompt not as a command, but rather as the beginning of a StackOverflow post!

Even though it essentially gets the code correct – it “knows” how to write code, in some latent buried sense – it surrounds the request with the extraneous StackOverflow-like text.

Unearthing those latent abilities is part of the magic of RLHF, and why we do a lot of advanced, Copilot-style RLHF training for customers as well.

RLHF output

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (10)

How does RLHF work?

So how do you train an RLHF model on human feedback? We’ll dive into the technical details in a follow-up post, but in the meantime, here’s a high-level overview of the steps:

Step #1: Unsupervised pre-training

Start with a pre-trained language model (a next-token predictor), like the original GPT-3.

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (11)

Step #2: Supervised finetuning

Form a set of commands (e.g., “generate a story about Harry Potter”), and a human-written response to each command. In other words, form a training dataset of <prompt, ideal generation> pairs.

(Video) Reinforcement Learning: Crash Course AI #9

This data collection and data generation is what LLM companies use our Surge AI platform for!

Then finetune the pre-trained model to output these human responses.

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (12)

Step #3: Training a “human feedback” reward model

The third step is building a reward model that scores how good an LLM output is for a given response. In other words, a reward model is another model (e.g., another LLMwith its final layers chopped off) that takes a prompt and generation as input, and outputs a scalar reward.

So once again, form a set of new commands. Then form a set of machine-generated responses to these commands, and ask Surgers to score or rank their quality.

Use this dataset to train a reward model that outputs a quality score for any <prompt, generation> pair.

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (13)

Step #4: Train a Reinforcement Learning policy that optimizes based on the reward model

Finally, train a Reinforcement Learning policy (a policy, in this case, is essentially an algorithm that outputs the next word or token) that optimizes based on the reward model (i.e., tries to generate text that the reward model thinks humans prefer). This is our new RLHF algorithm!

In other words, in Reinforcement Learning models, taking actions from a given state will provide the policy with a reward that it will try to maximize. In this case, our policy is essentially just a language model, its actions are generating text, and its reward is the score that the reward model gives the generated text.

So to train this RLHFmodel:

  • First, initialize the RL policy to be the finetuned LLM from Step 2.Then repeat the following:
  • Take a prompt and use the RL policy to generate an output.
  • Use the reward model to calculate a reward for this generation. (This is essentially simulating how a human would score the generation.)
  • Update the RL policy based on the reward (i.e., the policy is now learning whether it is generating good or bad responses).

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (14)

Et voilà! You now have a state-of-the-art large language model like ChatGPT.


Of course, there are many nuances to getting these steps right. For example:

  • When performing the supervised finetuning and reward modeling steps, what set of prompts do you want to use? How do you create them in a diverse way? For example, if you want to create a good summarizer, what kinds of texts are you summarizing, how long are the summaries, and how do you phrase the summarization requests to capture real-world needs?
  • When training a reward model, what set of generations do you give human evaluators, and how do they score them?

These questions are exactly what we build RLHF tooling to help our customers with. If you’re training your own RLHF models and want help, reach out to team@surgehq.ai!


In short, RLHF is the new state-of-the-art for training language models that align to human preferences and give us what we want. OpenAI even found that RLHFmodels are vastly more efficient: their 1.3B-parameter RLHF model outperformed their 175B-parameter non-RLHF model, despite having more than 100x fewer parameters!

We’re lucky to be at the forefront of this movement, helping LLM companies around the world use human feedback to:

  • Train LLMs to use tools – like search engines, IDEs, and spreadsheets – via human demonstrations
  • Train them to code
  • Train them to solve advanced math and science problems
  • …and more.

As Nathan Lambert, a research scientist at Hugging Face writes:

(Video) Deep Q-Learning - Combining Neural Networks and Reinforcement Learning

Explaining Reinforcement Learning with Human Feedback (RLHF #1) (15)

In our next post, we’ll dive into the technical details, nuances, and open questions surrounding RLHF. To follow along with this RLHF series, follow us on Twitter @HelloSurgeAI or on LinkedIn!

Need RLHF datasets to train the next generation of AI? We work with top AI companies to provide the rich, high-quality human feedback they need to build state-of-the-art language models. Reach out to team@surgehq.ai to learn more!


1. Reinforcement Learning Full Course | Reinforcement Learning In Machine Learning | Simplilearn
2. [Classic] Playing Atari with Deep Reinforcement Learning (Paper Explained)
(Yannic Kilcher)
3. CS25 I Stanford Seminar - Decision Transformer: Reinforcement Learning via Sequence Modeling
(Stanford Online)
4. Reinforcement Learning | Reinforcement Learning In Python | Machine Learning Tutorial | Simplilearn
5. ChatGPT For Cybersecurity
6. Machine Learning Basics | What Is Machine Learning? | Introduction To Machine Learning | Simplilearn
Top Articles
Latest Posts
Article information

Author: Roderick King

Last Updated: 11/06/2023

Views: 6490

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Roderick King

Birthday: 1997-10-09

Address: 3782 Madge Knoll, East Dudley, MA 63913

Phone: +2521695290067

Job: Customer Sales Coordinator

Hobby: Gunsmithing, Embroidery, Parkour, Kitesurfing, Rock climbing, Sand art, Beekeeping

Introduction: My name is Roderick King, I am a cute, splendid, excited, perfect, gentle, funny, vivacious person who loves writing and wants to share my knowledge and understanding with you.