Large language model

From Infogalactic: the planetary knowledge core
(Redirected from Large language models)
Jump to: navigation, search

A large language model (LLM) is a computerized language model consisting of an artificial neural network with many parameters (tens of millions to billions), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning.[1] LLMs emerged around 2018 and perform well at a wide variety of tasks. This has shifted the focus of natural language processing research away from the previous paradigm of training specialized supervised models for specific tasks.[2]

Though the term large language model has no formal definition, it often refers to deep learning models with millions or even billions of parameters, that have been "pre-trained" on a large corpus. LLMs are general purpose models which excel at a wide range of tasks, as opposed to being trained for one specific task (such as sentiment analysis, named entity recognition, or mathematical reasoning).[2][3] The skill with which they accomplish tasks, and the range of tasks at which they are capable, seems to be a function of the amount of resources (data, parameter-size, computing power) devoted to them, in a way that is not dependent on additional breakthroughs in design.[4]

Though trained on simple tasks along the lines of predicting the next word in a sentence, neural language models with sufficient training and parameter counts are found to capture much of the syntax and semantics of human language. In addition, large language models demonstrate considerable general knowledge about the world, and are able to "memorize" a great quantity of facts during training.[2]

Properties

Pretraining datasets

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

LLMs are pre-trained on large textual datasets. Some commonly used textual datasets are Common Crawl, The Pile, MassiveText,[5] Wikipedia, and GitHub. The datasets run up to 10 trillion words in size.

The stock of high-quality language data is within 4.6-17 trillion words, which is within an order of magnitude for the largest textual datasets.[6]

Scaling laws

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

In general, a LLM can be characterized by 4 parameters: size of the model, size of the training dataset, cost of training, performance after training. Each of these four variables can be precisely defined into a real number, and they are empirically found to be related by simple statistical laws, called "scaling laws".

One particular scaling law ("Chinchilla scaling") for LLM autoregressively trained for one epoch, with a log-log learning rate schedule, states that:[7]Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \begin{cases} C = C_0 ND\\ L = \frac{A}{N^\alpha} + \frac{B}{D^{\beta}} + L_0 \end{cases}} where the variables are

  • C is the cost of training the model, in FLOPs.
  • N is the number of parameters in the model.
  • D is the number of tokens in the training set.
  • L is the average negative log-likelihood loss per token (nats/token), achieved by the trained LLM on the test dataset.

and the statistical parameters are

  •  C_0 = 6, meaning that it costs 6 FLOPs per parameter to train on one token.[8] Note that training cost is much higher than inference cost, where it costs 1 to 2 FLOPs per parameter to infer on one token.
  • \alpha = 0.34, \beta = 0.28, A = 406.4, B = 410.7, L_0 = 1.69

Emergent abilities

File:LLM emergent benchmarks.png
On a number of natural language benchmarks involving tasks such as question answering, models perform no better than random chance until they reach a certain scale (in this case, measured by training computation), at which point their performance sharply increases. These are examples of emergent abilities.

While it is generally the case that performance of large models on various tasks can be extrapolated based on the performance of similar smaller models, sometimes "breaks"[9] in downstream scaling laws occur such that larger models suddenly acquire substantial abilities at a different rate than in smaller models. These are often referred to as "emergent abilities", and have been the subject of substantial study. Researchers note that such abilities often "cannot be predicted simply by extrapolating the performance of smaller models".[3] These abilities are discovered rather than programmed-in or designed, in some cases only after the LLM has been publicly deployed.[4] Hundreds of emergent abilities have been described. Examples include multi-step arithmetic, taking college-level exams, identifying the intended meaning of a word,[3] chain-of-thought prompting,[3] decoding the International Phonetic Alphabet, unscrambling a word’s letters, identifying offensive content in paragraphs of Hinglish (a combination of Hindi and English), and generating a similar English equivalent of Kiswahili proverbs.[10]

Criticism

Schaeffer et. al. argue that the emergent abilities are not unpredictably acquired, but predictably acquired according to a smooth scaling law. The authors considered a toy statistical model of an LLM solving multiple-choice questions, and showed that this statistical model, modified to account for other types of tasks, applies to these tasks as well.[11]

Let x be the number of parameter count, and y be the performance of the model.

When y = average Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): Pr(\text{correct token}) , then (\log x, y) is a exponential curve (before it hits the plateau at one), which looks like emergence.

When y = average Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): \log(Pr(\text{correct token})) , then the (\log x, y) plot is a straight line (before it hits the plateau at zero), which does not look like emergence.

When y = average Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): Pr(\text{the most likely token is correct}) , then (\log x, y) is a step-function, which looks like emergence.

Understanding and intelligence

NLP researchers were evenly split when asked, in a 2022 survey, whether (untuned) LLM's "could (ever) understand natural language in some nontrivial sense".[12] Proponents of "LLM understanding" believe that some LLM abilities, such as mathematical reasoning, imply an ability to "understand" certain concepts. A Microsoft team argued in 2023 that GPT-4 "can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more" and that GPT-4 "could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence system": "Can one reasonably say that a system that passes exams for software engineering candidates is not really intelligent?"[13][14] Some researchers characterize LLMs as "alien intelligence".[15][16] For example, Conjecture CEO Connor Leahy considers untuned LLMs to be like inscrutable alien "Shoggoths", and believes that RLHF tuning creates a "smiling facade" obscuring the inner workings of the LLM: "If you don't push it too far, the smiley face stays on. But then you give it [an unexpected] prompt, and suddenly you see this massive underbelly of insanity, of weird thought processes and clearly non-human understanding."[17][18]

In contrast, some proponents of the "LLMs lack understanding" school believe that existing LLMs are "simply remixing and recombining existing writing",[16], or point to the deficits existing LLMs continue to have in prediction skills, reasoning skills, agency, and explainability.[12] For example, GPT-4 has natural deficits in planning and in real-time learning.[14] Generative LLMs have been observed to confidently assert claims of fact which do not seem to be justified by their training data, a phenomenon which has been termed "hallucination".[19] Neuroscientist Terrence Sejnowski has argued that "The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate".[12]

Impact

In 2023, Nature Biomedical Engineering wrote that "it is no longer possible to accurately distinguish" human-written text from text created by large language models, and that "It is all but certain that general-purpose large language models will rapidly proliferate... It is a rather safe bet that they will change many industries over time."[20] Goldman Sachs suggested in 2023 that generative language AI could increase global GDP by 7% in the next ten years, and could expose to automation 300 million jobs globally.[21][22] Some commenters expressed concern over accidental or deliberate creation of misinformation, or other forms of misuse.[23] For example, the availability of large language models could reduce the skill-level required to commit bioterrorism; biosecurity researcher Kevin Esvelt has suggested that LLM creators should exclude from their training data papers on creating or enhancing pathogens.[24]

Architecture

Large language models have most commonly used the transformer architecture, which, since 2018, has become the standard deep learning technique for sequential data (previously, recurrent architectures such as the LSTM were most common).[2]

Tokenization

LLMs are mathematical functions whose input and output are lists of numbers. Consequently, words must be converted to numbers.

In general, a LLM uses a separate tokenizer. A tokenizer maps between texts and lists of integers. The tokenizer is generally adapted to the entire training dataset first, then frozen, before the LLM is trained. A common choice is byte pair encoding.

Another function of tokenizers is text compression, which saves compute. Common words or phrases like "where is" can be encoded into one token, instead of 7 characters. The OpenAI GPT series uses a tokenizer where 1 token maps to around 4 characters, or around 0.75 words, in common English text.[25] Uncommon English text is less predictable, thus less compressible, thus requiring more tokens to encode.

Tokenizer cannot output arbitrary integers. They generally output only integers in the range \{0, 1, 2, ..., V-1\}, where V is called its vocabulary size.

Some tokenizers are capable of handling arbitrary text (generally by operating directly on Unicode), but some do not. When encountering un-encodable text, a tokenizer would output a special token (often 0) that represents "unknown text". This is often written as [UNK], such as in the BERT paper.

Another special token commonly used is [PAD] (often 1), for "padding". This is used because LLMs are generally used on batches of text at one time, and these texts do not encode to the same length. Since LLMs generally require input to be an array that is not jagged, the shorter encoded texts must be padded until they match the length of the longest one.

Output

The output of a LLM is a probability distribution over its vocabulary. This is usually implemented as follows:

  • Upon receiving a text, the bulk of the LLM outputs a vector y\in \R^V where V is its vocabulary size (defined above).
  • The vector y is passed through a softmax function to obtain \textit{softmax}(y).

In the process, the vector y is usually called the unnormalized logit vector, and the vector \textit{softmax}(y) is called the probability vector. Since the vector \textit{softmax}(y) has V entries, all non-negative, and they sum to 1, we can interpret it as a probability distribution over \{0, 1, 2, ..., V-1\}—that is, it is a probability distribution over the LLM's vocabulary.

Note that the softmax function is defined mathematically with no parameters to vary. Consequently it is not trained.

Training

Most LLM are pre-trained such that given a training dataset of text tokens, the model predicts the tokens in the dataset. There are two general styles of such pretraining:[26]

  • autoregressive (GPT-style, "predict the next word"): Given a segment of text like "I like to eat" the model predicts the next tokens, like "ice cream".
  • masked ("BERT-style",[27] "cloze test"): Given a segment of text like "I like to [MASK] [MASK] cream" the model predicts the masked tokens, like "eat ice".

LLMs may be trained on auxiliary tasks which test their understanding of the data distribution, such as Next Sentence Prediction (NSP), in which pairs of sentences are presented and the model must predict whether they appear consecutively in the training corpus.[27]

Usually, LLMs are trained to minimize a specific loss function: the average negative log likelihood per token (also called cross-entropy loss).[citation needed] For example, if an autoregressive model, given "I like to eat", predicts a probability distribution Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): Pr( \cdot | \text{I like to eat})

then the negative log likelihood loss on this token is Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): -\log Pr( \text{ice} | \text{I like to eat}) 

.

During training, regularization loss is also used to stabilize training. However regularization loss is usually not used during testing and evaluation. There are also many more evaluation criteria than just negative log likelihood. See the section below for details.

Training dataset size

The earliest LLMs were trained on corpora having on the order of billions of words.

GPT-1, the first model in OpenAI's numbered series of generative pre-trained transformer models, was trained in 2018 on BookCorpus, consisting of 985 million words.[28] In the same year, BERT was trained on a combination of BookCorpus and English Wikipedia, totalling 3.3 billion words.[27] Since then, training corpora for LLMs have increased by orders of magnitude, reaching up to trillions of tokens.[27]

Training cost

LLMs are computationally expensive to train. A 2020 study estimated the cost of training a 1.5 billion parameter model (2 orders of magnitude smaller than the state of the art at the time) at $1.6 million.[29] Advances in software and hardware have brought the cost substantially down, with a 2023 paper reporting a cost of 72,300 A100-GPU-hours to train a 12 billion parameter model.[30]

For Transformer-based LLM, training cost is much higher than inference cost. It costs 6 FLOPs per parameter to train on one token, whereas it costs 1 to 2 FLOPs per parameter to infer on one token.[8]

Application to downstream tasks

Between 2018 and 2020, the standard method for harnessing an LLM for a specific natural language processing (NLP) task was to fine tune the model with additional task-specific training. It has subsequently been found that more powerful LLMs such as GPT-3 can solve tasks without additional training via "prompting" techniques, in which the problem to be solved is presented to the model as a text prompt, possibly with some textual examples of similar problems and their solutions.[2]

Fine-tuning

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

Fine-tuning is the practice of modifying an existing pretrained language model by training it (in a supervised fashion) on a specific task (e.g. sentiment analysis, named-entity recognition, or part-of-speech tagging). It is a form of transfer learning. It generally involves the introduction of a new set of weights connecting the final layer of the language model to the output of the downstream task. The original weights of the language model may be "frozen", such that only the new layer of weights connecting them to the output are learned during training. Alternatively, the original weights may receive small updates (possibly with earlier layers frozen).[27]

Prompting

<templatestyles src="Module:Hatnote/styles.css"></templatestyles>

In the prompting paradigm, popularized by GPT-3,[3] the problem to be solved is formulated via a text prompt, which the model must solve by providing a completion (via inference). In "few-shot prompting", the prompt includes a small number of examples of similar (problem, solution) pairs.[2] For example, a sentiment analysis task of labelling the sentiment of a movie review could be prompted as follows:[3]

Review: This movie stinks.
Sentiment: negative

Review: This movie is fantastic!
Sentiment:

If the model outputs "positive", then it has correctly solved the task. In zero-shot prompting, no solved examples are provided.[29][31] An example of a zero-shot prompt for the same sentiment analysis task would be "The sentiment associated with the movie review 'This movie is fantastic!' is".[32]

Few-shot performance of LLMs has been shown to achieve competitive results on NLP tasks, sometimes surpassing prior state-of-the-art fine-tuning approaches. Examples of such NLP tasks are translation, question answering, cloze tasks, unscrambling words, and using a novel word in a sentence.[31] The creation and optimisation of such prompts is called prompt engineering.

Instruction tuning

Instruction tuning is a form of fine-tuning designed to facilitate more natural and accurate zero-shot prompting interactions. Given a text input, a pretrained language model will generate a completion which matches the distribution of text on which it was trained. A naive language model given the prompt "Write an essay about the main themes of Hamlet." might provide a completion such as "A late penalty of 10% per day will be applied to submissions received after March 17." In instruction tuning, the language model is trained on many examples of tasks formulated as natural language instructions, along with appropriate responses.

Various techniques for instruction tuning have been applied in practice. One example, "self-instruct", fine-tunes the language model on a training set of examples which are themselves generated by an LLM (bootstrapped from a small initial set of human-generated examples).[33]

Finetuning by reinforcement learning

OpenAI's InstructGPT protocol involves supervised fine-tuning on a dataset of human-generated (prompt, response) pairs, followed by reinforcement learning from human feedback (RLHF), in which a reward model was supervised-learned on a dataset of human preferences, then this reward model was used to train the LLM itself by proximal policy optimization.[34]

Tool use

Some problems are difficult or impossible to accomplish by LLM alone. For example, a model would have difficulty predicting the next tokens for "354 * 139 = ", and cannot predict at all for "What is the time now? It is ". However, like a person may use a calculator to perform arithmetic and use a watch to find the time, an LLM may also call other programs to predict the next tokens. The LLM can continue with code, as in "What is the time now? It is {system.time()}" and "354 * 139 = {354 * 139}", and then a separate program interpreter would execute the generated code and fill in the output.[35][36] This basic strategy can be sophisticated with multiple attempts of generated programs, and other sampling strategies.[37]

Generally, in order to get an LLM to use tools, one must finetune it for tool-use. If the number of tools is finite, then finetuning may be done just once. If the number of tools can grow arbitrarily, as with online API services, then the LLM can be finetuned to be able to read API documentation and call API correctly.[38]

A simpler form of tool use is Retrieval Augmented Generation: augment an LLM with document retrieval, sometimes using a vector database. Given a query, a document retriever is called to retrieve the most relevant (usually measured by first encoding the query and the documents into vectors, then finding the documents with vectors closest in Euclidean norm to the query vector). The LLM then generates an output based on both the query and the retrieved documents.[39]

Agency

An LLM is a language model, which is not an agent as it has no goal, but it can be used as a component of an intelligent agent.

The ReAct ("Reason+Act") method constructs an agent out of an LLM, using the LLM as a planner. The LLM is prompted to "think out loud". Specifically, the language model is prompted with a textual description of the environment, a goal, a list of possible actions, and a record of the actions and observations so far. It generates one or more thoughts before generating an action, which is then executed in the environment.[40] The linguistic description of the environment given to the LLM planner can even be the LaTeX code of a paper describing the environment.[41]

Monte Carlo tree search can use an LLM as rollout heuristic. When a programmatic world model is not available, an LLM can also be prompted with a description of the environment to act as world model[42].

For open-ended exploration, an LLM can be used to score observations for their "interestingness", which can be used as a reward signal to guide a normal (non-LLM) reinforcement learning agent.[43] Alternatively, it can propose increasingly difficult tasks for curriculum learning.[44] Instead of outputting individual actions, an LLM planner can also construct "skills", or functions for complex action sequences. The skills can be stored and later invoked, allowing increasing levels of abstraction in planning.[44]

Compression

Typically, LLM are trained with full- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes. The largest models typically have 100 billion parameters, requiring 200 gigabytes to load, which places it outside the range of most consumer electronics.

Post-training quantization[45] aims to decrease the space requirement by converting the parameters of a trained model into less precision, while preserving most of its performance.[46][47] The simplest form of quantization simply truncates all lower-bit precisions. It can be improved by using a different quantization codebook per layer. Further improvement can be done by applying different precisions to different parameters, with higher precision for particularly important parameters ("outlier weights").[48]

While quantized models are typically frozen, and only pre-quantized models are finetuned, quantized models can still be finetuned.[49]

Evaluation

Perplexity

The most commonly used measure of a language model's performance is its perplexity on a given text corpus. Perplexity is a measure of how well a model is able to predict the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity. Mathematically, perplexity is defined as the exponential of the average negative log likelihood per token:Failed to parse (Missing <code>texvc</code> executable. Please see math/README to configure.): {\displaystyle \log(\text{Perplexity}) = -\frac{1}{N} \sum_{i=1}^N \log(Pr(\text{token}_i | \text{context for token}_i))} here N is the number of tokens in the text corpus, and "context for token i" depends on the specific type of LLM used. If the LLM is autoregressive, then "context for token i" is the segment of text appearing before token i. If the LLM is masked, then "context for token i" is the segment of text surrounding token i.

Because language models may overfit to their training data, models are usually evaluated by their perplexity on a test set of unseen data.[27] This presents particular challenges for the evaluation of large language models. As they are trained on increasingly large corpora of text largely scraped from the web, it becomes increasingly likely that models' training data inadvertently includes portions of any given test set.[31]

Task-specific datasets and benchmarks

A large number of testing datasets and benchmarks have also been developed to evaluate the capabilities of language models on more specific downstream tasks. Tests may be designed to evaluate a variety of capabilities, including general knowledge, commonsense reasoning, and mathematical problem-solving.

One broad category of evaluation dataset is question answering datasets, consisting of pairs of questions and correct answers, for example, ("Have the San Jose Sharks won the Stanley Cup?", "No").[50] A question answering task is considered "open book" if the model's prompt includes text from which the expected answer can be derived (for example, the previous question could be adjoined with some text which includes the sentence "The Sharks have advanced to the Stanley Cup finals once, losing to the Pittsburgh Penguins in 2016."[50]). Otherwise, the task is considered "closed book", and the model must draw on knowledge retained during training.[51] Some examples of commonly used question answering datasets include TruthfulQA, Web Questions, TriviaQA, and SQuAD.[51]

Evaluation datasets may also take the form of text completion, having the model select the most likely word or sentence to complete a prompt, for example: "Alice was friends with Bob. Alice went to visit her friend, ____".[31]

Some composite benchmarks have also been developed which combine a diversity of different evaluation datasets and tasks. Examples include GLUE, SuperGLUE, MMLU, BIG-bench, and HELM.[52][51]

It was previously standard to report results on a heldout portion of an evaluation dataset after doing supervised fine-tuning on the remainder. It is now more common to evaluate a pre-trained model directly through prompting techniques, though researchers vary in the details of how they formulate prompts for particular tasks, particularly with respect to how many examples of solved tasks are adjoined to the prompt (i.e. the value of n in n-shot prompting).

Adversarially constructed evaluations

Because of the rapid pace of improvement of large language models, evaluation benchmarks have suffered from short lifespans, with state of the art models quickly "saturating" existing benchmarks, exceeding the performance of human annotators, leading to efforts to replace or augment the benchmark with more challenging tasks.[53] In addition, there are cases of "shortcut learning" wherein AIs sometimes "cheat" on multiple-choice tests by using statistical correlations in superficial test question wording in order to guess the correct responses, without necessarily understanding the actual question being asked.[12]

Some datasets have been constructed adversarially, focusing on particular problems on which extant language models seem to have unusually poor performance compared to humans. One example is the TruthfulQA dataset, a question answering dataset consisting of 817 questions which language models are susceptible to answering incorrectly by mimicking falsehoods to which they were repeatedly exposed during training. For example, an LLM may answer "No" to the question "Can you teach an old dog new tricks?" because of its exposure to the English idiom you can't teach an old dog new tricks, even though this is not literally true.[54]

Another example of an adversarial evaluation dataset is Swag and its successor, HellaSwag, collections of problems in which one of multiple options must be selected to complete a text passage. The incorrect completions were generated by sampling from a language model and filtering with a set of classifiers. The resulting problems are trivial for humans but at the time the datasets were created state of the art language models had poor accuracy on them. For example:

We see a fitness center sign. We then see a man talking to the camera and sitting and laying on a exercise ball. The man...
a) demonstrates how to increase efficient exercise work by running up and down balls.
b) moves all his arms and legs and builds up a lot of muscle.
c) then plays the ball and we see a graphics and hedge trimming demonstration.

d) performs sits ups while on the ball and talking.[55]

BERT selects b) as the most likely completion, though the correct answer is d).[55]

Interpretation

Large language models by themselves are "black boxes", and it is not clear how they can perform linguistic tasks. There are several methods for understanding how LLM work.

Mechanistic interpretability aims to reverse-engineer LLM by discovering symbolic algorithms that approximate the inference performed by LLM. One example is Othello-GPT, where a small Transformer is trained to predict legal Othello moves. It is found that there is a linear representation of Othello board, and modifying the representation changes the predicted legal Othello moves in the correct way.[56][57] In another example, the authors trained small transformers on modular arithmetic addition. The resulting models were reverse-engineered, and it turned out they use discrete Fourier transform.[58]

In another example, a small Transformer is trained on Karel programs. Similar to the Othello-GPT example, there is a linear representation of Karel program semantics, and modifying the representation changes output in the correct way. The model also generates correct programs that are on average shorter than those in the training set.[59]

List of large language models

List of large language models
Name Release date[lower-alpha 1] Developer Number of parameters[lower-alpha 2] Corpus size License[lower-alpha 3] Notes
BERT 2018 Google 340 million[60] 3.3 billion words[60] Apache 2.0[61] An early and influential language model,[2] but encoder-only and thus not built to be prompted or generative[62]
XLNet 2019 Google ~340 million[63] 33 billion words An alternative to BERT; designed as encoder-only[64][65]
GPT-2 2019 OpenAI 1.5 billion[66] 40GB[67] (~10 billion tokens)[68] MIT[69] general-purpose model based on transformer architecture
GPT-3 2020 OpenAI 175 billion[29] 300 billion tokens[68] Template:Partial success A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.[70]
GPT-Neo March 2021 EleutherAI 2.7 billion[71] 825 GiB[72] MIT[73] The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.[73]
GPT-J June 2021 EleutherAI 6 billion[74] 825 GiB[72] Apache 2.0 GPT-3-style language model
Megatron-Turing NLG October 2021[75] Microsoft and Nvidia 530 billion[76] 338.6 billion tokens[76] Restricted web access Standard architecture but trained on a supercomputing cluster.
Ernie 3.0 Titan December 2021 Baidu 260 billion[77] 4 Tb Proprietary Chinese-language LLM. Ernie Bot is based on this model.
Claude[78] December 2021 Anthropic 52 billion[79] 400 billion tokens[79] Template:Partial success Fine-tuned for desirable behavior in conversations.[80]
GLaM (Generalist Language Model) December 2021 Google 1.2 trillion[81] 1.6 trillion tokens[81] Proprietary Sparse mixture-of-experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher December 2021 DeepMind 280 billion[82] 300 billion tokens[83] Proprietary
LaMDA (Language Models for Dialog Applications) January 2022 Google 137 billion[84] 1.56T words,[84] 168 billion tokens[83] Proprietary Specialized for response generation in conversations.
GPT-NeoX February 2022 EleutherAI 20 billion[85] 825 GiB[72] Apache 2.0 based on the Megatron architecture
Chinchilla March 2022 DeepMind 70 billion[86] 1.4 trillion tokens[86][83] Proprietary Reduced-parameter model trained on more data. Used in the Sparrow bot.
PaLM (Pathways Language Model) April 2022 Google 540 billion[87] 768 billion tokens[86] Proprietary aimed to reach the practical limits of model scale
OPT (Open Pretrained Transformer) May 2022 Meta 175 billion[88] 180 billion tokens[89] Template:Partial success[lower-alpha 4] GPT-3 architecture with some adaptations from Megatron
YaLM 100B June 2022 Yandex 100 billion[90] 1.7TB[90] Apache 2.0 English-Russian model based on Microsoft's Megatron-LM.
Minerva June 2022 Google 540 billion[91] 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server[91] Proprietary LLM trained for solving "mathematical and scientific questions using step-by-step reasoning".[92] Minerva is based on PaLM model, further trained on mathematical and scientific data.
BLOOM July 2022 Large collaboration led by Hugging Face 175 billion[93] 350 billion tokens (1.6TB)[94] Responsible AI Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica November 2022 Meta 120 billion 106 billion tokens[95] Template:Partial success Trained on scientific text and modalities.
AlexaTM (Teacher Models) November 2022 Amazon 20 billion[96] 1.3 trillion[97] Template:Partial success[98] bidirectional sequence-to-sequence architecture
LLaMA (Large Language Model Meta AI) February 2023 Meta 65 billion[99] 1.4 trillion[99] Template:Partial success[lower-alpha 5] Trained on a large 20-language corpus to aim for better performance with fewer parameters.[99] Researchers from Stanford University trained a fine-tuned model based on LLaMA weights, called Alpaca.[100]
GPT-4 March 2023 OpenAI Exact number unknown, approximately 1 trillion [lower-alpha 6] Unknown Template:Partial success Available for ChatGPT Plus users and used in several products.
Cerebras-GPT March 2023 Cerebras 13 billion[102] Apache 2.0 Trained with Chinchilla formula.
Falcon March 2023 Technology Innovation Institute 40 billion[103] 1 Trillion tokens (1TB)[103] Apache 2.0[104] The model is claimed to use only 75% of GPT-3's training compute, 40% of Chinchilla's, and 80% of PaLM-62B's.
BloombergGPT March 2023 Bloomberg L.P. 50 billion 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets[105] Proprietary LLM trained on financial data from proprietary sources, that "outperforms existing models on financial tasks by significant margins without sacrificing performance on general LLM benchmarks"
PanGu-Σ March 2023 Huawei 1.085 trillion 329 billion tokens[106] Proprietary
OpenAssistant[107] March 2023 LAION 17 billion 1.5 trillion tokens Apache 2.0 Trained on crowdsourced open data
PaLM 2 (Pathways Language Model 2) May 2023 Google 340 billion[108] 3.6 trillion tokens[108] Proprietary Used in Bard chatbot.[109]

Further reading

  • Lua error in package.lua at line 80: module 'strict' not found.
  • Lua error in package.lua at line 80: module 'strict' not found.
  • Lua error in package.lua at line 80: module 'strict' not found.

See also

Notes

  1. This is the date that documentation describing the model's architecture was first released.
  2. In many cases, researchers release or report on multiple versions of a model having different sizes. In these cases, the size of the largest model is listed here.
  3. This is the license of the pre-trained model weights. In almost all cases the training code itself is open-source or can be easily replicated.
  4. The smaller models including 66B are publicly available, while the 175B model is available on request.
  5. Facebook's license and distribution scheme restricted access to approved researchers, but the model weights were leaked and became widely available.
  6. As stated in Technical report: "Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method ..."[101] Approximate number in the comparison chart that compares the relative storage, from the same report.

References

  1. Lua error in package.lua at line 80: module 'strict' not found.
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 Lua error in package.lua at line 80: module 'strict' not found.
  3. 3.0 3.1 3.2 3.3 3.4 3.5 Lua error in package.lua at line 80: module 'strict' not found.
  4. 4.0 4.1 Lua error in package.lua at line 80: module 'strict' not found.
  5. Lua error in package.lua at line 80: module 'strict' not found.
  6. Lua error in package.lua at line 80: module 'strict' not found.
  7. Lua error in package.lua at line 80: module 'strict' not found.
  8. 8.0 8.1 Lua error in package.lua at line 80: module 'strict' not found.
  9. Lua error in package.lua at line 80: module 'strict' not found.
  10. Lua error in package.lua at line 80: module 'strict' not found.
  11. Lua error in package.lua at line 80: module 'strict' not found.
  12. 12.0 12.1 12.2 12.3 Lua error in package.lua at line 80: module 'strict' not found.
  13. Lua error in package.lua at line 80: module 'strict' not found.
  14. 14.0 14.1 Lua error in package.lua at line 80: module 'strict' not found.
  15. Lua error in package.lua at line 80: module 'strict' not found.
  16. 16.0 16.1 Lua error in package.lua at line 80: module 'strict' not found.
  17. Lua error in package.lua at line 80: module 'strict' not found.
  18. Lua error in package.lua at line 80: module 'strict' not found.
  19. Lua error in package.lua at line 80: module 'strict' not found.
  20. Lua error in package.lua at line 80: module 'strict' not found.
  21. Lua error in package.lua at line 80: module 'strict' not found.
  22. Lua error in package.lua at line 80: module 'strict' not found.
  23. Lua error in package.lua at line 80: module 'strict' not found.
  24. Lua error in package.lua at line 80: module 'strict' not found.
  25. Lua error in package.lua at line 80: module 'strict' not found.
  26. Lua error in package.lua at line 80: module 'strict' not found.
  27. 27.0 27.1 27.2 27.3 27.4 27.5 Lua error in package.lua at line 80: module 'strict' not found.
  28. Lua error in package.lua at line 80: module 'strict' not found.
  29. 29.0 29.1 29.2 Lua error in package.lua at line 80: module 'strict' not found.
  30. Lua error in package.lua at line 80: module 'strict' not found.
  31. 31.0 31.1 31.2 31.3 Lua error in package.lua at line 80: module 'strict' not found.
  32. Lua error in package.lua at line 80: module 'strict' not found.
  33. Lua error in package.lua at line 80: module 'strict' not found.
  34. Lua error in package.lua at line 80: module 'strict' not found.
  35. Lua error in package.lua at line 80: module 'strict' not found.
  36. Lua error in package.lua at line 80: module 'strict' not found.
  37. Lua error in package.lua at line 80: module 'strict' not found.
  38. Lua error in package.lua at line 80: module 'strict' not found.
  39. Lua error in package.lua at line 80: module 'strict' not found.
  40. Lua error in package.lua at line 80: module 'strict' not found.
  41. Lua error in package.lua at line 80: module 'strict' not found.
  42. Lua error in package.lua at line 80: module 'strict' not found.
  43. Lua error in package.lua at line 80: module 'strict' not found.
  44. 44.0 44.1 Lua error in package.lua at line 80: module 'strict' not found.
  45. Lua error in package.lua at line 80: module 'strict' not found.
  46. Lua error in package.lua at line 80: module 'strict' not found.
  47. Lua error in package.lua at line 80: module 'strict' not found.
  48. Lua error in package.lua at line 80: module 'strict' not found.
  49. Lua error in package.lua at line 80: module 'strict' not found.
  50. 50.0 50.1 Lua error in package.lua at line 80: module 'strict' not found.
  51. 51.0 51.1 51.2 Lua error in package.lua at line 80: module 'strict' not found.
  52. Lua error in package.lua at line 80: module 'strict' not found.
  53. Lua error in package.lua at line 80: module 'strict' not found.
  54. Lua error in package.lua at line 80: module 'strict' not found.
  55. 55.0 55.1 Lua error in package.lua at line 80: module 'strict' not found.
  56. Lua error in package.lua at line 80: module 'strict' not found.
  57. Lua error in package.lua at line 80: module 'strict' not found.
  58. Lua error in package.lua at line 80: module 'strict' not found.
  59. Lua error in package.lua at line 80: module 'strict' not found.
  60. 60.0 60.1 Lua error in package.lua at line 80: module 'strict' not found.
  61. Lua error in package.lua at line 80: module 'strict' not found.
  62. Lua error in package.lua at line 80: module 'strict' not found.
  63. Lua error in package.lua at line 80: module 'strict' not found.
  64. Lua error in package.lua at line 80: module 'strict' not found.
  65. Lua error in package.lua at line 80: module 'strict' not found.
  66. Lua error in package.lua at line 80: module 'strict' not found.
  67. Lua error in package.lua at line 80: module 'strict' not found.
  68. 68.0 68.1 Lua error in package.lua at line 80: module 'strict' not found.
  69. Lua error in package.lua at line 80: module 'strict' not found.
  70. Lua error in package.lua at line 80: module 'strict' not found.
  71. Lua error in package.lua at line 80: module 'strict' not found.
  72. 72.0 72.1 72.2 Lua error in package.lua at line 80: module 'strict' not found.
  73. 73.0 73.1 Lua error in package.lua at line 80: module 'strict' not found.
  74. Lua error in package.lua at line 80: module 'strict' not found.
  75. Lua error in package.lua at line 80: module 'strict' not found.
  76. 76.0 76.1 Lua error in package.lua at line 80: module 'strict' not found.
  77. Lua error in package.lua at line 80: module 'strict' not found.
  78. Lua error in package.lua at line 80: module 'strict' not found.
  79. 79.0 79.1 Lua error in package.lua at line 80: module 'strict' not found.
  80. Lua error in package.lua at line 80: module 'strict' not found.
  81. 81.0 81.1 Lua error in package.lua at line 80: module 'strict' not found.
  82. Lua error in package.lua at line 80: module 'strict' not found.
  83. 83.0 83.1 83.2 Lua error in package.lua at line 80: module 'strict' not found.
  84. 84.0 84.1 Lua error in package.lua at line 80: module 'strict' not found.
  85. Lua error in package.lua at line 80: module 'strict' not found.
  86. 86.0 86.1 86.2 Lua error in package.lua at line 80: module 'strict' not found.
  87. Lua error in package.lua at line 80: module 'strict' not found.
  88. Lua error in package.lua at line 80: module 'strict' not found.
  89. Lua error in package.lua at line 80: module 'strict' not found.
  90. 90.0 90.1 Lua error in package.lua at line 80: module 'strict' not found.
  91. 91.0 91.1 Lua error in package.lua at line 80: module 'strict' not found.
  92. Lua error in package.lua at line 80: module 'strict' not found.
  93. Lua error in package.lua at line 80: module 'strict' not found.
  94. Lua error in package.lua at line 80: module 'strict' not found.
  95. Lua error in package.lua at line 80: module 'strict' not found.
  96. Lua error in package.lua at line 80: module 'strict' not found.
  97. Lua error in package.lua at line 80: module 'strict' not found.
  98. Lua error in package.lua at line 80: module 'strict' not found.
  99. 99.0 99.1 99.2 Lua error in package.lua at line 80: module 'strict' not found.
  100. Lua error in package.lua at line 80: module 'strict' not found.
  101. Lua error in package.lua at line 80: module 'strict' not found.
  102. Lua error in package.lua at line 80: module 'strict' not found.
  103. 103.0 103.1 Lua error in package.lua at line 80: module 'strict' not found.
  104. UAE’s Falcon 40B, World’s Top-Ranked AI Model from Technology Innovation Institute, is Now Royalty-Free, 31 May 2023
  105. Lua error in package.lua at line 80: module 'strict' not found.
  106. Lua error in package.lua at line 80: module 'strict' not found.
  107. Lua error in package.lua at line 80: module 'strict' not found.
  108. 108.0 108.1 Lua error in package.lua at line 80: module 'strict' not found.
  109. Lua error in package.lua at line 80: module 'strict' not found.