State of GPT - Andrej Karpathy MS Build

May 24, 2023

Notes from State of GPT by Andrej Karpath given at MS Build 2023.

The talk is two parts, first how we train GPT assistents and second how to use them in your application.

Training:

Four Stages: Pretraining, Supervised Finetuning, Reward Modeling, Reinforcement Learning
Each stage is executed serially
Each stage has a dataset that powers the stage
We have an algorithm which is the stage goal/loss function
And at each stage we have a resulitng model

The pretraining stage is the most difficult. Its 99% of the training compute time and flops. These are internet scale dataset and can take months.

The other stages are finetuning stages that only takes days.

Base Model:

First Data Collection for pretraining (examples)

CommonCrawl, C4, Github, Wikipedia, Books, Stack Exchange etc.

After having the raw text we need to tokenize the text into integers. Mapping the tokens to integers. Typical numbers of tokens is 10k-100k tokens

2 example models

GPT-3 - 50K token vocab, 2048 context length, 175B Parameters, Trained on 300B tokens!
LLaMA - 32K token vocab, 2048 context length, 65B parameters, trained on 1-1.4T tokens

Do not judge the power of a model by its number of parameters. LLaMA is more powerful than GPT-3 because it was trained longer

GPT-3 Took roughly $1-10M to train over about a month on 1k-10k v100 GPUS, cost $1-10M to build
LLaMA took roughly 21 days to train on 2048 A100 GPUS costing $5M

Pretraining

Inputs into the Transformer are arrays of shape (B,T)
- B is the batch size (number of rows)
- T is the context lenght (number of columns)

The B, T matrix is all the documents joined with end of text makers: so:

 1, 2, 3, 4
 7, 1, 4, 3
 3, 2, 9, 4

Assuming 3, 4 and the end of text marker is “4”

The neural network will work through each cell trying ot predict the next token.

A sample of tiny shakesphere trained by the New York Times is shared. After 30K iterations it looks like shakesphare.

After a month of training the model learns very genernal representations and they are easy.

Replacement of large labeled datasets for sentiment analysis with LLM.

Base models can be prompted into completing tasks.

This moved from fine tuning to prompting.

There’s a huge family of base models.

Base models are not assistents. You have to “trick” them with a proper prompt. So we start doing supervised fine tuning to make the model more refined.

Supervised Finetuning

Have human contractors gather data of the format “prompt, ideal response”.

Around 10K data values. Lots of text in these prompt, ideal response documents.

SFT (Supervised Fine Tuning) Model - Can be deployed

Reward Modeling

Have the model generate multiple responses to a prompt. Humans score the responses.

Place these responses and their scores into a B,T matrix train on that.

RM Model is not deployable.

Reinforcement Learning

Add Prompt, Response, Score to B,T matrix. Train on that.

Prompt 1, Completion 1, Reward score
Prompt 1, Completion 2, Reward score
Prompt 1, Completion 3, Reward score
Prompt 2, Completion 1, Reward score
Prompt 2, Completion 2, Reward score
...

RL Model is deployable.

Examples of different types of models

Base Modes: GPT, LLaMA, PaLM
Supervised Finetuning: Vicuna-13B
Reward Modeling (No Examples - model is not deployable)
Reinforcement Learning: ChatGPT, Claude

ChatGPT is an example of RLHF - (Reinforcement Learning from Human Feedback) Model

Why do RHLF?

It works better - according to comparision scoring by humans
Why does it work better? We don’t currently know for sure.
Andre thinks its related to how difficult it is to compare vs generate computationally. for example if a human contractor is asked to generate a Haiku about paper clips its very difficult for them, but if they’re asked to score three Haikus about paper clips they’ll do a better job so the RLHF flavor of the model will probably perform better then an SFT model.

RLHF models are not very diverse. Base models will have much more entropy.

Base models are much better at generation/creativity as they haven’t yet been “corrected” by RLHF.

For example base models are better for generating Pokemon names then a RLHF model

There are ELO ratings for Assistent models done by Berkly.

Applications

Andre details how a human might write the sentance: “California’s population is 53 times that of Alaska”

Humans have a lot of steps and computation writing this sentence

Looking up the populations of the two states
Dividing them fail by mental task
Use calculator to get ratio
Sanity check the result - did I do it right?
Start writing sentenance
Edit Sentance

From GPT perspective this is just a sequence of tokens. Each token is the same amount of work.

All of the internal diaglog of generating a sentance is stripped, GPT spends the same amount of effort generating each token
The LLM doesn’t know what it doesn’t know, it imitates the next token
The LLM doesn’t know what its not good at, it imitates the next token
The LLM doesn’t reflect or sanity check and they don’t correct mistakes along the way
There is no separet inner monolog stream
They do have very large fact-based knowledge accross a vast number of areas (model compression)
They do have a very large and “perfect” working memory - their context window.

All the internal dialog when writing a document done by a human is stripped by GPT. Its just constant time “next token”. They don’t know what they don’t know. They don’t reflect in the loop or do sanity checks. There’s no monolog stream.

They have a few cognative advantages:

They know a lot of facts - 10 billion!
They have a very large working memory
It can remember anything in its context window.

Prompting is largely making up for the cognative differences between these two systems.

The transformer cannot do to much reasoning per token. The transformer needs tokens to think!

For example asking the transformer to “show its work” will sometimes provide a better result.

“lets think step by step” allows the transformer to do less computational work per token.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Wei et al. 2022
Large Language Models are Zero-Shot Reasoners, Kojima et al. 2022

Self-Consistency

The transformer can sample bad tokens and get into bad reasoning
Giving the tranformer a chance to correct itself by resampling does work
Self-Consistency Improves Chain of Thought Reasoning in Language Models Wang et al. 2023

Ask for relfection It appears that LLM’s know when they are incorrect. You can ask the model if it completed the assignment. This can allow it to reprompt and try again.

evjang.com/2023/03/26/self-reflection.html

We’re trying to recreate humans system 1 (fast, automatic) /system 2 (slow, deliberate) reasoning systems. Kind of like alpha go play planning, but for text.

Mastering the game of go without human knowledge, Silver et al. 2017 (AlphaGo)
Tree of Thoughts: Deliberate Problem Solving with Large Language Models, Yao et al. 2023

Chains Agents

ReAct: Synergizing Reasoning and acting in Language Models, Yao et al. 2022
AutoGPT: Allows LLM to keep a task list

LLM’s don’t want to succeed, they want ot imitate training sets.

Large Language Models are Human-Level Prompt Engineers 2023

There’s so many solutions in its knowledge base and they are just trained on language modeling. Transformers can tell the difference between a good solution and a bad solution in their training set, but they don’t know which one you want. You need to ask for good performance. You need to ask for the right answer. “You are an expert on this topic and think step by step to arrive at a strong answer”

Transformers do know what they are not good at. They just know “next token”. For example we should prompt the LLM to use a calculator to execute calculations - something its not good at.

LLM’s are memory only right now but retrieval based models are even better.

Retreival augemented models.

LlamaIndex

Transformers have very large and extensive memory but its not perfect, its helpful to allow it to use a library to look things up.

Constrained Prompting - enforce that the token output will be json for example.

github.com/microsoft/guidance

Fine Tuning

you can get far with prompt engineering but Finetuning is a good idea.
only change small peices of the model weights - CLAMP most of the model.
Takes a lot of human work, but is accessible to basic people in industry
Language Models are Few-Shot Learners, Brown et al. 2020

RLHF is still research territory.

Limitations:

Models may biased
Models may fabricate (“hallucinate”) information
Models may have reasoning errors
Models may struggle in classes of applciations eg spelling related tasks
Models may have knowledge cutoffs - eg September 2021
Models are susceptible to prompt injection

Recommendations:

Use in low states applications that have human oversight
Source of inspiration & suggestions
Copilots over automonous agents