docs.json

[
  {
    "pageContent": "[MUSIC] ANNOUNCER:\nPlease welcome AI researcher and founding member of\nOpenAI, Andrej Karpathy. ANDREJ KARPATHY:\nHi, everyone. I'm happy to be here to tell you\nabout the state of GPT and more generally about the rapidly growing ecosystem\nof large language models. I would like to partition\nthe talk into two parts. In the first part, I would\nlike to tell you about how we train GPT Assistance, and then in the second part, we're going to take a\nlook at how we can use these assistants effectively\nfor your applications. First, let's take a\nlook at the emerging recipe for how to train these assistants and keep\nin mind that this is all very new and still\nrapidly evolving, but so far, the recipe\nlooks something like this. Now, this is a\ncomplicated slide, I'm going to go\nthrough it piece by piece, but roughly speaking, we have four major\nstages, pretraining, supervised finetuning,\nreward modeling, reinforcement learning, and they follow each\nother serially. Now, in each stage, we have a dataset that\npowers that stage. We have an algorithm that\nfor our purposes will be a objective and over for\ntraining the neural network, and then we have a\nresulting model, and then there are some\nnotes on the bottom. The first stage\nwe're going to start with as the pretraining stage. Now, this stage is\nspecial in this diagram, and this diagram is\nnot to scale because this stage is where all of the computational work\nbasically happens. This is 99 percent\nof the training compute time and also flops. This is where we\nare dealing with Internet scale datasets\nwith thousands of GPUs in the supercomputer and also months of\ntraining potentially. The other three\nstages are finetuning stages that are much more along the lines of small few number\nof GPUs and hours or days. Let's take a look at\nthe pretraining stage to achieve a base model. First, we are going to gather\na large amount of data. Here's an example\nof what we call a data mixture that comes from this paper that was released by Meta where they released\nthis LLaMA based model. Now, you can see roughly\nthe datasets that enter into these collections. We have CommonCrawl, which\nis a web scrape, C4, which is also CommonCrawl, and then some high\nquality datasets as well. For example, GitHub, Wikipedia, Books, Archives, Stock\nExchange and so on. These are all mixed up together, and then they are sampled according to some\ngiven proportions, and that forms the\ntraining set for the GPT. Now before we can actually\ntrain on this data, we need to go through one\nmore preprocessing step, and that is tokenization. This is basically\na translation of the raw text that we scrape\nfrom the Internet into sequences of integers because that's the native representation over which GPTs function. Now, this is a\nlossless translation between pieces of texts\nand tokens and integers, and there are a number of\nalgorithms for the stage. Typically, for\nexample, you could use something like\nbyte pair encoding, which iteratively\nmerges text chunks and groups them into tokens. Here, I'm showing some example\nchunks of these tokens, and then this is the\nraw integer sequence that will actually feed\ninto a transformer. Now, here I'm showing two examples for\nhybrid parameters that govern this stage. GPT-4, we did not release too much information about\nhow it was trained and so on, I'm using GPT-3s numbers, but GPT-3 is of course\na little bit old by now, about three years ago. But LLaMA is a fairly\nrecent model from Meta. These are roughly the orders of magnitude that we're dealing with when we're\ndoing pretraining. The vocabulary size is usually\na couple 10,000 tokens. The context length is usually\nsomething like 2,000, 4,000, or nowadays even 100,000, and this governs the maximum\nnumber of integers that the GPT will look at\nwhen it's trying to predict the next\ninteger in a sequence. You can see that roughly the\nnumber of parameters say, 65 billion for LLaMA. Now, even though LLaMA\nhas only 65B parameters compared to GPP-3s 175\nbillion parameters, LLaMA is a significantly\nmore powerful model, and intuitively, that's because the model is trained for\nsignificantly longer. In this case, 1.4\ntrillion tokens, instead of 300 billion tokens. You shouldn't judge the\npower of a model by the number of parameters\nthat it contains. Below, I'm showing\nsome tables of rough hyperparameters\nthat typically go into specifying the\ntransformer neural network, the number of heads, the dimension size,\nnumber of layers, and so on, and on the bottom I'm showing some training\nhyperparameters. For example, to\ntrain the 65B model, Meta used 2,000 GPUs, roughly 21 days of training and a roughly several\nmillion dollars. That's the rough orders of\nmagnitude that you should have in mind for the\npre-training stage. Now, when we're actually\npre-training, what happens? Roughly speaking, we are\ngoing to take our tokens, and we're going to lay them\nout into data batches. We have these arrays that will feed into\nthe transformer, and these arrays are B, the batch size and these are\nall independent examples stocked up in rows and B by T, T being the maximum\ncontext length. In my picture I only have\n10 the context lengths, so this could be\n2,000, 4,000, etc. These are extremely long rows. What we do is we take\nthese documents, and we pack them into rows, and we delimit them with these special end\nof texts tokens, basically telling\nthe transformer where a new document begins. Here, I have a few examples\nof documents and then I stretch them out\ninto this input. Now, we're going to feed all of these numbers into transformer. Let me just focus on a\nsingle particular cell, but the same thing\nwill happen at every cell in this diagram. Let's look at the green cell. The green cell is going to take a look at all of the\ntokens before it, so all of the tokens in yellow, and we're going to feed\nthat entire context into the transforming\nneural network, and the transformer\nis going to try to predict the next token in a sequence, in this case in red. Now the transformer,\nI don't have too much time to, unfortunately, go into the full details of this neural network architecture is just a large blob of neural\nnet stuff for our purposes, and it's got several, 10 billion parameters typically\nor something like that. Of course, as I tune\nthese parameters, you're getting slightly different predicted\ndistributions for every single\none of these cells. For example, if our vocabulary\nsize is 50,257 tokens, then we're going\nto have that many numbers because we need to specify a probability distribution for\nwhat comes next. Basically, we have\na probability for whatever may follow. Now, in this specific example, for this specific cell, 513 will come next, and so we can use this as a source of supervision to update our transformers weights. We're applying this basically on every single cell\nin the parallel, and we keep swapping batches, and we're trying to get\nthe transformer to make the correct\npredictions over what token comes next in a sequence. Let me show you more\nconcretely what this looks like when you train\none of these models. This is actually coming\nfrom the New York Times, and they trained a small\nGPT on Shakespeare. Here's a small snippet\nof Shakespeare, and they train their GPT on it. Now, in the beginning, at initialization, the GPT starts with\ncompletely random weights. You're getting completely\nrandom outputs as well. But over time, as you train\nthe GPT longer and longer, you are getting more\nand more coherent and consistent samples\nfrom the model, and the way you sample\nfrom it, of course, is you predict what comes next, you sample from that\ndistribution and you keep feeding that\nback into the process, and you can basically\nsample large sequences. By the end, you see\nthat the transformer has learned about words and where to put spaces and where\nto put commas and so on. We're making more and more consistent\npredictions over time. These are the plots\nthat you are looking at when you're doing\nmodel pretraining. Effectively, we're looking at the loss function over\ntime as you train, and low loss means\nthat our transformer is giving a higher probability to the next correct\ninteger in the sequence. What are we going\nto do with model once we've trained\nit after a month? Well, the first thing that\nwe noticed, we the field, is that these models basically in the process\nof language modeling, learn very powerful\ngeneral representations, and it's possible to very\nefficiently fine tune them for any arbitrary\ndownstream tasks you might be interested in. As an example, if you're interested in sentiment\nclassification, the approach used to be that you collect a\nbunch of positives and negatives and then you\ntrain some NLP model for that, but the\nnew approach is: ignore sentiment classification,\ngo off and do large language model pretraining, train a large transformer, and then you may only\nhave a few examples and you can very\nefficiently fine tune your model for that task. This works very\nwell in practice. The reason for this\nis that basically the transformer is forced to multitask a huge amount of tasks in the language\nmodeling task, because in terms of\npredicting the next token, it's forced to understand a\nlot about the structure of the text and all the\ndifferent concepts therein. That was GPT-1. Now\naround the time of GPT-2, people noticed that actually even better than fine tuning, you can actually prompt these\nmodels very effectively. These are language\nmodels and they want to complete documents, you can actually trick\nthem into performing tasks by arranging\nthese fake documents. In this example, for example, we have some passage and\nthen we like do QA, QA, QA. This is called Few-shot\nprompt, and then we do Q, and then as the\ntransformer is tried to complete the document is\nactually answering our question. This is an example of prompt\nengineering based model, making it believe\nthat it's imitating a document and getting\nit to perform a task. This kicked off, I think\nthe era of, I would say, prompting over fine tuning\nand seeing that this actually can work extremely\nwell on a lot of problems, even without training\nany neural networks, fine tuning or so on. Now since then, we've seen an entire evolutionary tree of base models that\neveryone has trained. Not all of these\nmodels are available. for example, the GPT-4 base\nmodel was never released. The GPT-4 model\nthat you might be interacting with over\nAPI is not a base model, it's an assistant model, and we're going to cover\nhow to get those in a bit. GPT-3 based model is\navailable via the API under the name Devanshi and\nGPT-2 based model is available even as\nweights on our GitHub repo. But currently the best\navailable base model probably is the LLaMA\nseries from Meta, although it is not\ncommercially licensed. Now, one thing to point out is base models are not assistants. They don't want to make\nanswers to your questions, they want to complete documents. If you tell them to write a poem about the\nbread and cheese, it will answer questions\nwith more questions, it's completing what it\nthinks is a document. However, you can prompt\nthem in a specific way for base models that is\nmore likely to work. As an example, here's a poem\nabout bread and cheese, and in that case it will\nautocomplete correctly. You can even trick base\nmodels into being assistants. The way you would do\nthis is you would create a specific few-shot prompt that makes it look like there's some document between\nthe human and assistant and they're exchanging\ninformation. Then at the bottom, you put your query at the\nend and the base model will condition itself into being a helpful\nassistant and answer, but this is not very\nreliable and doesn't work super well in practice,\nalthough it can be done. Instead, we have a\ndifferent path to make actual GPT assistants not base\nmodel document completers. That takes us into\nsupervised finetuning. In the supervised\nfinetuning stage, we are going to collect small but high quality\ndata-sets, and in this case, we're going to ask human\ncontractors to gather data of the form prompt and\nideal response. We're going to\ncollect lots of these typically tens of thousands\nor something like that. Then we're going to\nstill do language modeling on this data. Nothing changed algorithmically, we're swapping out\na training set. It used to be\nInternet documents, which has a high quantity local for basically Q8 prompt\nresponse data. That is low quantity,\nhigh quality. We will still do\nlanguage modeling and then after training, we get an SFT model. You can actually deploy\nthese models and they are actual assistants\nand they work to some extent. Let me show you what an example demonstration\nmight look like. Here's something that\na human contractor might come up with. Here's some random\nprompt. Can you write a short introduction about the relevance of the term monopsony or\nsomething like that? Then the contractor also\nwrites out an ideal response. When they write out\nthese responses, they are following\nextensive labeling documentations and\nthey are being asked to be helpful,\ntruthful, and harmless. These labeling\ninstructions here, you probably can't\nread it, neither can I, but they're long\nand this is people following instructions\nand trying to complete these prompts. That's what the\ndataset looks like. You can train these models.\nThis works to some extent. Now, you can actually\ncontinue the pipeline from here on, and go into RLHF, reinforcement learning\nfrom human feedback that consists of both reward modeling and reinforcement learning. Let me cover that and then I'll come back to why you\nmay want to go through the extra steps and how that\ncompares to SFT models. In the reward modeling step, what we're going to do is\nwe're now going to shift our data collection to be\nof the form of comparisons. Here's an example of what\nour dataset will look like. I have the same identical\nprompt on the top, which is asking the\nassistant to write a program or a function that checks if a given\nstring is a palindrome. Then what we do is we\ntake the SFT model which we've already trained and we\ncreate multiple completions. In this case, we have\nthree completions that the model has created, and then we ask people to\nrank these completions. If you stare at this for\na while, and by the way, these are very\ndifficult things to do to compare some of\nthese predictions. This can take people\neven hours for a single prompt\ncompletion pairs, but let's say we decided\nthat one of these is much better than the others\nand so on. We rank them. Then we can follow that with something that looks\nvery much like a binary classification on all the possible pairs\nbetween these completions. What we do now is, we lay\nout our prompt in rows, and the prompt is identical\nacross all three rows here. It's all the same prompt, but the completion of this varies. The yellow tokens are\ncoming from the SFT model. Then what we do is we append another special reward\nreadout token at the end and we basically only supervise the transformer\nat this single green token. The transformer will\npredict some reward for how good that completion is for that prompt and\nbasically it makes a guess about the quality\nof each completion. Then once it makes a guess\nfor every one of them, we also have the ground truth which is telling us\nthe ranking of them. We can actually\nenforce that some of these numbers should\nbe much higher than others, and so on. We formulate this into\na loss function and we train our model to make\nreward predictions that are consistent with\nthe ground truth coming from the comparisons from\nall these contractors. That's how we train\nour reward model. That allows us to score how good a completion is for a prompt. Once we have a reward model, we can't deploy this\nbecause this is not very useful as an\nassistant by itself, but it's very useful\nfor the reinforcement learning stage that follows now. Because we have a reward model, we can score the quality of any arbitrary completion\nfor any given prompt. What we do during\nreinforcement learning is we basically get, again, a large collection of\nprompts and now we do reinforcement learning\nwith respect to the reward model. Here's\nwhat that looks like. We take a single prompt, we lay it out in rows, and now we use basically the model we'd like\nto train which was initialized at SFT model to create some\ncompletions in yellow, and then we append the\nreward token again and we read off the reward according to the reward model, which is now kept fixed. It doesn't change any\nmore. Now the reward model tells us the quality of\nevery single completion for all these prompts and so\nwhat we can do is we can now just basically apply the same language modeling loss function, but we're currently training\non the yellow tokens, and we are weighing the language modeling objective by the rewards indicated\nby the reward model. As an example, in the first row, the reward model\nsaid that this is a fairly high-scoring completion and so all the tokens that we happen to sample on the\nfirst row are going to get reinforced and\nthey're going to get higher probabilities\nfor the future. Conversely, on the second row, the reward model\nreally did not like this completion, -1.2. Therefore, every single\ntoken that we sampled in that second row is going to get a slightly higher\nprobability for the future. We do this over and over on many prompts on many\nbatches and basically, we get a policy that\ncreates yellow tokens here. It's basically all the\ncompletions here will score high according to the reward model that we\ntrained in the previous stage. That's what the\nRLHF pipeline is. Then at the end, you get a\nmodel that you could deploy. As an example, ChatGPT\nis an RLHF model, but some other models\nthat you might come across for example, Vicuna-13B, and so on, these are SFT models. We have base models, SFT\nmodels, and RLHF models. That's the state\nof things there. Now why would you\nwant to do RLHF? One answer that's not that exciting is that\nit works better. This comes from the\ninstruct GPT paper. According to these\nexperiments a while ago now, these PPO models are RLHF. We see that they are\nbasically preferred in a lot of comparisons when we\ngive them to humans. Humans prefer basically tokens that come from RLHF models\ncompared to SFT models, compared to base model\nthat is prompted to be an assistant. It\njust works better. But you might ask why\ndoes it work better? I don't think that there's\na single amazing answer that the community\nhas really agreed on, but I will offer one\nreason potentially. It has to do with the\nasymmetry between how easy computationally it is to\ncompare versus generate. Let's take an example\nof generating a haiku. Suppose I ask a model to write\na haiku about paper clips. If you're a contractor\ntrying to train data, then imagine being a contractor collecting basically\ndata for the SFT stage, how are you supposed to create a nice haiku for a paper clip? You might not be\nvery good at that, but if I give you\na few examples of haikus you might be able to appreciate some of these\nhaikus a lot more than others. Judging which one of these is\ngood is a much easier task. Basically, this asymmetry makes it so that comparisons are a better way to\npotentially leverage yourself as a human and your judgment to create\na slightly better model. Now, RLHF models are not strictly an improvement on the\nbase models in some cases. In particular, we'd\nnotice for example that they lose some entropy. That means that they\ngive more peaky results. They can output samples with lower variation\nthan the base model. The base model has\nlots of entropy and will give lots of\ndiverse outputs. For example, one\nplace where I still prefer to use a base\nmodel is in the setup where you basically have n things and you want to\ngenerate more things like it. Here is an example\nthat I just cooked up. I want to generate\ncool Pokemon names. I gave it seven Pokemon names\nand I asked the base model to complete the document and it gave me a lot more\nPokemon names. These are fictitious. I\ntried to look them up. I don't believe they're\nactual Pokemons. This is the task that I think\nthe base model would be good at because it still\nhas lots of entropy. It'll give you lots\nof diverse cool more things that look like\nwhatever you give it before. Having said all that, these are the assistant models\nthat are probably available to you at this point. There was a team at Berkeley\nthat ranked a lot of the available assistant models and give them\nbasically Elo ratings. Currently, some of\nthe best models, of course, are GPT-4, by far, I would say, followed by Claude, GPT-3.5, and then a\nnumber of models, some of these might be\navailable as weights, like Vicuna, Koala, etc. The first three rows here are all RLHF models and all of the other models\nto my knowledge, are SFT models, I believe. That's how we train these models on the high level. Now I'm going to switch gears and let's look at how we can best apply the GPT assistant\nmodel to your problems. Now, I would like to work in setting of a\nconcrete example. Let's work with a\nconcrete example here. Let's say that you\nare working on an article or a blog post, and you're going to write\nthis sentence at the end. \"California's population is 53 times that of Alaska.\"\nSo for some reason, you want to compare the\npopulations of these two states. Think about the rich\ninternal monologue and tool use and how much work actually goes computationally in your brain to generate\nthis one final sentence. Here's maybe what that could\nlook like in your brain. For this next step, let\nme blog on my blog, let me compare these\ntwo populations. First I'm going to\nobviously need to get both of these populations. Now, I know that I probably don't know these\npopulations off the top of my head so I'm aware of what I know or don't\nknow of my self-knowledge. I go, I do some tool use\nand I go to Wikipedia and I look up California's population\nand Alaska's population. Now, I know that I should\ndivide the two, but again, I know that dividing 39.2 by 0.74 is very\nunlikely to succeed. That's not the thing that I can do in my head and so therefore, I'm going to rely on the calculator so I'm\ngoing to use a calculator, punch it in and see that\nthe output is roughly 53. Then maybe I do some reflection and\nsanity checks in my brain so does 53 makes sense? Well, that's quite\na large fraction, but then California is the most populous state, so\nmaybe that looks okay. Then I have all the\ninformation I might need, and now I get to the\ncreative portion of writing. I might start to write\nsomething like \"California has 53x times greater\" and\nthen I think to myself, that's actually like really\nawkward phrasing so let me actually delete that\nand let me try again. As I'm writing, I have\nthis separate process, almost inspecting what I'm writing and judging\nwhether it looks good or not and then maybe I delete\nand maybe I reframe it, and then maybe I'm happy\nwith what comes out. Basically long story short, a ton happens under\nthe hood in terms of your internal monologue when you create sentences like this. But what does a sentence\nlike this look like when we are training\na GPT on it? From GPT's perspective, this is just a sequence of tokens. GPT, when it's reading or\ngenerating these tokens, it just goes chunk,\nchunk, chunk, chunk and each chunk is roughly the same amount of computational\nwork for each token. These transformers are not very shallow networks they have about 80 layers of reasoning, but 80 is still\nnot like too much. This transformer is going\nto do its best to imitate, but of course, the process here looks very different from\nthe process that you took. In particular, in\nour final artifacts in the data sets that we create, and then eventually feed to LLMs, all that internal\ndialogue was completely stripped and unlike you, the GPT will look at\nevery single token and spend the same amount of\ncompute on every one of them. So, you can't expect it to do too much work per token\nand also in particular, basically these transformers are just like token simulators, they don't know what\nthey don't know. They just imitate\nthe next token. They don't know what they're\ngood at or not good at. They just tried their best\nto imitate the next token. They don't reflect in the loop. They don't sanity\ncheck anything. They don't correct their\nmistakes along the way. By default, they just are\nsample token sequences. They don't have separate\ninner monologue streams in their head right? They're evaluating what's happening. Now, they do have some\ncognitive advantages, I would say and that is\nthat they do actually have a very large\nfact-based knowledge across a vast number of\nareas because they have, say, several, 10\nbillion parameters. That's a lot of storage\nfor a lot of facts. They also, I think have a relatively large and\nperfect working memory. Whatever fits into\nthe context window is immediately available to the transformer through its internal self\nattention mechanism and so it's perfect memory, but it's got a finite size, but the transformer has\na very direct access to it and so it can a losslessly remember\nanything that is inside its context window. This is how I would compare\nthose two and the reason I bring all of this\nup is because I think to a large extent, prompting is just making up for this cognitive\ndifference between these two architectures like our brains here and LLM brains. You can look at it\nthat way almost. Here's one thing that\npeople found for example works pretty well in practice. Especially if your tasks\nrequire reasoning, you can't expect the transformer to do too much\nreasoning per token. You have to really spread out the reasoning across\nmore and more tokens. For example, you can't\ngive a transformer a very complicated question and expect it to get the\nanswer in a single token. There's just not\nenough time for it. \"These transformers\nneed tokens to think,\" I like to say sometimes. This is some of the\nthings that work well, you may for example have\na few-shot prompt that shows the transformer\nthat it should show its work when it's answering question and if you\ngive a few examples, the transformer will imitate\nthat template and it will just end up working out better in terms of\nits evaluation. Additionally, you can\nelicit this behavior from the transformer by saying,\nlet things step-by-step. Because this conditions the\ntransformer into showing its work and because it snaps into a mode\nof showing its work, is going to do less\ncomputational work per token. It's more likely to succeed\nas a result because it's making slower\nreasoning over time. Here's another example, this one is called self-consistency. We saw that we had the ability to start writing and then\nif it didn't work out, I can try again and I\ncan try multiple times and maybe select the\none that worked best. In these approaches, you may sample not just once, but you may sample\nmultiple times and then have some\nprocess for finding the ones that are\ngood and then keeping just those samples or doing a majority vote or\nsomething like that. Basically these transformers\nin the process as they predict the next\ntoken, just like you, they can get unlucky and they could sample\na not a very good token and they can go down like a blind alley in\nterms of reasoning. Unlike you, they cannot\nrecover from that. They are stuck with\nevery single token they sample and so they will\ncontinue the sequence, even if they know that this sequence is\nnot going to work out. Give them the ability\nto look back, inspect or try to basically\nsample around it. Here's one technique also, it turns out that actually LLMs, they know when\nthey've screwed up, so as an example, say\nyou ask the model to generate a poem that does not rhyme and it might\ngive you a poem, but it actually rhymes. But it turns out\nthat especially for the bigger models like GPT-4, you can just ask it \"did you\nmeet the assignment?\" Actually GPT-4 knows very well that it did not\nmeet the assignment. It just got unlucky\nin its sampling. It will tell you, \"No,\nI didn't actually meet the assignment here.\nLet me try again.\" But without you prompting it it doesn't know to\nrevisit and so on. You have to make up for\nthat in your prompts, and you have to get it to check, if you don't ask it to check, its not going to check by itself it's just a token simulator. I think more generally, a lot of these\ntechniques fall into the bucket of what I would\nsay recreating our System 2. You might be familiar\nwith the System 1 and System 2 thinking for humans. System 1 is a fast\nautomatic process and I think corresponds to an\nLLM just sampling tokens. System 2 is the\nslower deliberate planning part of your brain. This is a paper actually from just last week because this space is pretty\nquickly evolving, it's called Tree of Thought. The authors of this paper\nproposed maintaining multiple completions\nfor any given prompt and then they are also\nscoring them along the way and keeping\nthe ones that are going well if\nthat makes sense. A lot of people\nare really playing around with prompt engineering to basically bring back some of these abilities that we\nhave in our brain for LLMs. Now, one thing I\nwould like to note here is that this is\nnot just a prompt. This is actually prompts\nthat are together used with some Python\nGlue code because you actually have to\nmaintain multiple prompts and you also have to do some tree search algorithm here to figure out which\nprompts to expand, etc. It's a symbiosis of\nPython Glue code and individual prompts that are called in a while loop or\nin a bigger algorithm. I also think there's\na really cool parallel here to AlphaGo. AlphaGo has a policy for placing the next stone\nwhen it plays go, and its policy was trained originally by imitating humans. But in addition to this policy, it also does Monte\nCarlo Tree Search. Basically, it will play out\na number of possibilities in its head and evaluate all of them and only keep the\nones that work well. I think this is an equivalent of AlphaGo but for text\nif that makes sense. Just like Tree of Thought, I think more\ngenerally people are starting to really explore more general techniques of not just the simple\nquestion-answer prompts, but something that\nlooks a lot more like Python Glue code stringing\ntogether many prompts. On the right, I have\nan example from this paper called\nReact where they structure the answer to a prompt as a sequence of\nthought-action-observation, thought-action-observation,\nand it's a full rollout and a thinking process\nto answer the query. In these actions, the model\nis also allowed to tool use. On the left, I have an\nexample of AutoGPT. Now AutoGPT by the way is a project that I think got\na lot of hype recently, but I think I still find it\ninspirationally interesting. It's a project that\nallows an LLM to keep the task list and continue to recursively break down tasks. I don't think this currently\nworks very well and I would not advise people to use it\nin practical applications. I just think it's something\nto generally take inspiration from in terms of where this\nis going, I think over time. That's like giving our\nmodel System 2 thinking. The next thing I\nfind interesting is, this following serve I would say almost psychological\nquirk of LLMs, is that LLMs don't\nwant to succeed, they want to imitate. You want to succeed, and\nyou should ask for it. What I mean by that is, when transformers are trained, they have training\nsets and there can be an entire spectrum of performance qualities\nin their training data. For example, there could\nbe some kind of a prompt for some physics question\nor something like that, and there could be a student's solution\nthat is completely wrong but there can also be an expert answer that is extremely right. Transformers can't tell the\ndifference between low, they know about\nlow-quality solutions and high-quality solutions, but by default, they\nwant to imitate all of it because they're just\ntrained on language modeling. At test time, you actually have to ask for a good performance. In this example in this paper, they tried various prompts. Let's think step-by-step\nwas very powerful because it spread out the\nreasoning over many tokens. But what worked even better is, let's work this out\nin a step-by-step way to be sure we have\nthe right answer. It's like conditioning on\ngetting the right answer, and this actually makes\nthe transformer work better because the\ntransformer doesn't have to now hedge its\nprobability mass on low-quality solutions, as ridiculous as that sounds. Basically, feel free to\nask for a strong solution. Say something like, you are a leading expert on this topic. Pretend you have IQ 120, etc. But don't try to ask for\ntoo much IQ because if you ask for IQ 400, you might be out of\ndata distribution, or even worse, you could be\nin data distribution for something like\nsci-fi stuff and it will start to take\non some sci-fi, or like roleplaying or\nsomething like that. You have to find the\nright amount of IQ. I think it's got some\nU-shaped curve there. Next up, as we saw when we are trying\nto solve problems, we know what we are good at\nand what we're not good at, and we lean on tools\ncomputationally. You want to do the same\npotentially with your LLMs. In particular, we\nmay want to give them calculators,\ncode interpreters, and so on, the\nability to do search, and there's a lot of\ntechniques for doing that. One thing to keep\nin mind, again, is that these transformers\nby default may not know what they don't know. You may even want to\ntell the transformer in a prompt you are not very\ngood at mental arithmetic. Whenever you need to do\nvery large number addition, multiplication, or whatever, instead, use this calculator. Here's how you use\nthe calculator, you use this token\ncombination, etc. You have to actually spell\nit out because the model by default doesn't know what\nit's good at or not good at, necessarily, just like\nyou and I might be. Next up, I think\nsomething that is very interesting is we went from a world that was retrieval\nonly all the way, the pendulum has swung\nto the other extreme where its memory only in LLMs. But actually, there's this\nentire space in-between of these retrieval-augmented\nmodels and this works extremely\nwell in practice. As I mentioned, the\ncontext window of a transformer is\nits working memory. If you can load\nthe working memory with any information that\nis relevant to the task, the model will work\nextremely well because it can immediately\naccess all that memory. I think a lot of people\nare really interested in basically retrieval-augment\ndegeneration. On the bottom, I have an\nexample of LlamaIndex which is one data connector to lots\nof different types of data. You can index all of that data and you can\nmake it accessible to LLMs. The emerging recipe there is\nyou take relevant documents, you split them up into chunks, you embed all of them, and you basically get\nembedding vectors that represent that data. You store that in the vector\nstore and then at test time, you make some kind of a query to your vector store and\nyou fetch chunks that might be relevant\nto your task and you stuff them into the\nprompt and then you generate. This can work quite\nwell in practice. This is, I think, similar to when you and I solve problems. You can do everything\nfrom your memory and transformers have very\nlarge and extensive memory, but also it really helps to reference some\nprimary documents. Whenever you find yourself going back to a textbook\nto find something, or whenever you find\nyourself going back to documentation of the library\nto look something up, transformers definitely\nwant to do that too. You have some memory over how some documentation\nof the library works but it's much\nbetter to look it up. The same applies here. Next, I wanted to briefly talk about constraint prompting. I also find this\nvery interesting. This is basically techniques for forcing a certain template\nin the outputs of LLMs. Guidance is one example\nfrom Microsoft actually. Here we are enforcing that the output from the\nLLM will be JSON. This will actually\nguarantee that the output will take on\nthis form because they go in and they mess with\nthe probabilities of all the different tokens that come out of the transformer and they clamp those tokens and then the transformer is only\nfilling in the blanks here, and then you can enforce\nadditional restrictions on what could go\ninto those blanks. This might be really\nhelpful, and I think this constraint sampling is\nalso extremely interesting. I also want to say a few words about fine tuning. It is the case that\nyou can get really far with prompt engineering, but it's also possible to think about fine\ntuning your models. Now, fine tuning\nmodels means that you are actually going to change\nthe weights of the model. It is becoming a lot more accessible to do\nthis in practice, and that's because of a number of techniques\nthat have been developed and have libraries\nfor very recently. So for example\nparameter efficient fine tuning techniques\nlike Laura, make sure that you're\nonly training small, sparse pieces of your model. So most of the model\nis kept clamped at the base model and some\npieces of it are allowed to change and this\nstill works pretty well empirically and makes it much cheaper to tune only\nsmall pieces of your model. It also means that because\nmost of your model is clamped, you can use very low\nprecision inference for computing those\nparts because you are not going\nto be updated by gradient descent and so that makes everything a lot\nmore efficient as well. And in addition, we\nhave a number of open source, high-quality\nbase models. Currently, as I mentioned, I think LLaMa is quite nice, although it is not commercially licensed, I believe right now. Some things to keep in\nmind is that basically fine tuning is a lot more\ntechnically involved. It requires a lot more, I think, technical expertise to do right. It requires human\ndata contractors for datasets and/or\nsynthetic data pipelines that can be pretty complicated. This will definitely slow down your iteration cycle by a lot, and I would say on\na high level SFT is achievable because\nyou're continuing the language modeling task. It's relatively\nstraightforward, but RLHF, I would say is very\nmuch research territory and is even much\nharder to get to work, and so I would probably\nnot advise that someone just tries to roll their own\nRLHF of implementation. These things are\npretty unstable, very difficult to train, not\nsomething that is, I think, very beginner\nfriendly right now, and it's also\npotentially likely also to change pretty rapidly still. So I think these are my default recommendations\nright now. I would break up your task\ninto two major parts. Number 1, achieve\nyour top performance, and Number 2, optimize your\nperformance in that order. Number 1, the best\nperformance will currently come from GPT-4 model. It is the most capable\nof all by far. Use prompts that\nare very detailed. They have lots of\ntask content, relevant information\nand instructions. Think along the lines\nof what would you tell a task contractor if they\ncan't email you back, but then also keep in mind\nthat a task contractor is a human and they have inner monologue and\nthey're very clever, etc. LLMs do not possess\nthose qualities. So make sure to think through the psychology of the LLM almost and cater\nprompts to that. Retrieve and add any\nrelevant context and information\nto these prompts. Basically refer to a lot of the prompt engineering\ntechniques. Some of them I've highlighted\nin the slides above, but also this is a very\nlarge space and I would just advise you to look for prompt engineering\ntechniques online. There's a lot to cover there. Experiment with\nfew-shot examples. What this refers to is, you\ndon't just want to tell, you want to show\nwhenever it's possible. So give it examples\nof everything that helps it really understand\nwhat you mean if you can. Experiment with tools\nand plug-ins to offload tasks that are\ndifficult for LLMs natively, and then think about not just a single prompt and answer, think about potential chains and reflection and how you glue them together and how you can potentially make multiple\nsamples and so on. Finally, if you think\nyou've squeezed out prompt engineering, which I think you should\nstick with for a while, look at some potentially fine tuning a model\nto your application, but expect this to be a lot more slower in the vault and then there's an expert\nfragile research zone here and I would\nsay that is RLHF, which currently does work a bit better than SFT if you\ncan get it to work. But again, this is pretty\ninvolved, I would say. And to optimize your costs, try to explore lower\ncapacity models or shorter prompts and so on. I also wanted to say a few\nwords about the use cases in which I think LLMs are\ncurrently well suited for. In particular, note that\nthere's a large number of limitations to LLMs today, and so I would keep that definitely in mind for\nall of your applications. Models, and this by the way\ncould be an entire talk. So I don't have time to\ncover it in full detail. Models may be biased,\nthey may fabricate, hallucinate information, they may have reasoning errors, they may struggle in entire\nclasses of applications, they have knowledge cut-offs, so they might not know\nany information above, say, September, 2021. They are susceptible\nto a large range of attacks which are coming\nout on Twitter daily, including prompt injection,\njailbreak attacks, data poisoning\nattacks and so on. So my recommendation\nright now is use LLMs in low-stakes\napplications. Combine them always\nwith human oversight. Use them as a source\nof inspiration and suggestions and think co-pilots, instead of completely\nautonomous agents that are just like\nperforming a task somewhere. It's just not clear that the\nmodels are there right now. So I wanted to close\nby saying that GPT-4 is an amazing artifact. I'm very thankful that it\nexists, and it's beautiful. It has a ton of knowledge\nacross so many areas. It can do math, code and so on. And in addition, there's this thriving ecosystem of everything else that is being built and incorporated into the ecosystem. Some of these things\nI've talked about, and all of this power is\naccessible at your fingertips. So here's everything\nthat's needed in terms of code to ask GPT-4 a question, to prompt it, and\nget a response. In this case, I said, can you say something to inspire the audience of\nMicrosoft Build 2023? And I just punched this\ninto Python and verbatim GPT-4 said the following: And by the way, I did\nnot know that they used this trick in the keynote. So I thought I was being clever, but it is really good at this. It says, ladies and gentlemen, innovators and trailblazers\nMicrosoft Build 2023. Welcome to the\ngathering of brilliant minds like no other, you are the architects\nof the future, the visionaries molding\nthe digital realm in which humanity thrives. Embrace the limitless\npossibilities of technologies and let your ideas soar as high as\nyour imagination. Together, let's create\na more connected, remarkable, and inclusive\nworld for generations to come. Get ready to unleash\nyour creativity, canvas the unknown, and\nturn dreams into reality. Your journey begins today!",
    "metadata": {
      "source": "bZQun8Y4L2A"
    }
  }
]