Llama (language model)

Llama is a family of large language models (LLMs) released by Meta AI starting in February 2023.

Background

After the release of large language models such as GPT-3, a focus of research was up-scaling models, which in some instances showed major increases in emergent capabilities. The release of ChatGPT and its surprise success caused an increase in attention to large language models. Compared with other responses to ChatGPT, Meta's Chief AI scientist Yann LeCun stated that large language models are best for aiding with writing. == Versions ==

Versions

Initial release The first version of Llama (stylized as LLaMA and sometimes referred to as Llama 1) was announced on February 24, 2023, via a blog post and a paper describing the model's training, architecture, and performance. Leak On March 3, 2023, a torrent containing Llama's weights was uploaded, with a link to the torrent shared on the 4chan imageboard and subsequently spread through online AI communities. On March 4, a pull request was opened to add links to HuggingFace repositories containing the model. On March 20, Meta filed a DMCA takedown request for copyright infringement against a repository containing a script that downloaded Llama from a mirror, and GitHub complied the next day. Reactions to the leak varied. Some speculated that the model would be used for malicious purposes, such as more sophisticated spam. Some have celebrated the model's accessibility, as well as the fact that smaller versions of the model can be run relatively cheaply, suggesting that this will promote the flourishing of additional research developments. The model architecture remains largely unchanged from that of Llama 1 models, but 40% more data was used to train the foundational models. Llama 2 includes foundation models and models fine-tuned for chat. In a further departure from the original version of Llama, all models are released with weights and may be used for many commercial use cases. Because Llama's license enforces an acceptable use policy that prohibits Llama from being used for some purposes, it is not open source. Meta's use of the term open-source to describe Llama has been disputed by the Open Source Initiative (which maintains The Open Source Definition) and others. Code Llama is a fine-tune of Llama 2 with code specific datasets. 7B, 13B, and 34B versions were released on August 24, 2023, with a 70B version released on January 29, 2024. Starting with the foundation models from Llama 2, Meta AI would train an additional 500B tokens of code datasets, before an additional 20B token of long-context data, creating the Code Llama foundation models. This foundation model was further trained on 5B instruction following token to create the instruct fine-tune. Another foundation model was created for Python code, which trained on 100B tokens of Python-only code, before the long-context data. Llama 3 On April 18, 2024, Meta released Llama 3 with two sizes: 8B and 70B parameters. The models have been pre-trained on approximately 15 trillion tokens of text gathered from “publicly available sources” with the instruct models fine-tuned on “publicly available instruction datasets, as well as over 10M human-annotated examples". Meta AI's testing showed in April 2024 that Llama 3 70B was beating Gemini Pro 1.5 and Claude 3 Sonnet on most benchmarks. Meta also announced plans to make Llama 3 multilingual and multimodal, better at coding and reasoning, and to increase its context window. Regarding scaling laws, Llama 3 models empirically showed that when a model is trained on data that is more than the "Chinchilla-optimal" amount, the performance continues to scale log-linearly. For example, the Chinchilla-optimal dataset for Llama 3 8B is 200 billion tokens, but performance continued to scale log-linearly to the 75-times larger dataset of 15 trillion tokens. During an interview with Dwarkesh Patel, Mark Zuckerberg said that the 8B version of Llama 3 was nearly as powerful as the largest Llama 2. Compared to previous models, Zuckerberg stated the team was surprised that the 70B model was still learning even at the end of the 15T tokens training. The decision was made to end training to focus GPU power elsewhere. Llama 3.1 was released on July 23, 2024, with three sizes: 8B, 70B, and 405B parameters. Llama 4 The Llama 4 series was released in 2025. The architecture was changed to a mixture of experts where only a fraction of the model’s expert sub-networks are activated per input token. They are multimodal (text and image input, text output) and multilingual (12 languages). • Scout: 17 billion active parameter model with 16 experts, context window of 10M, with 109B parameters in total. • Maverick: 17 billion active parameter model with 128 experts, context window of 1M, with 400B parameters in total. The Behemoth model was also announced, but was not released. Meta claimed it was a 288 billion active parameter model with 16 experts and around 2T parameters in total; it was still in training when Scout and Maverick were released. Maverick was codistilled from Behemoth, while Scout was trained from scratch. The training data included publicly available data, licensed data, and Meta-proprietary data such as publicly shared posts from Instagram and Facebook and people’s interactions with Meta AI. The knowledge cutoff was August 2024. The company also stated that Llama 4's benchmark score was achieved using an unreleased "experimental chat version" of the model that was "optimized for conversationality", which differed from the version of Llama 4 released to the public. LMArena indicated that it would change its policies to prevent this incident from reoccurring, and responded, "Meta's interpretation of our policy did not match what we expect from model providers. Meta should have made it clearer that 'Llama-4-Maverick-03-26-Experimental' was a customized model to optimize for human preference." Comparison of models For the training cost column, only the largest model's cost is written by default. For example, "21,000" is the training cost of Llama 2 69B in units of petaFLOPS-day. Also, 1 petaFLOPS-day = 1 petaFLOP/sec × 1 day = 8.64E19 FLOP. "T" means "trillion" and "B" means "billion". The following table lists the main model versions of Llama, describing the significant changes included with each version: == Architecture and training ==

Architecture and training

Architecture Like GPT-3, the Llama series of models are autoregressive decoder-only transformers, but there are some minor differences: • Llama uses the SwiGLU activation function instead of GPT-3's GeLU. • Llama uses rotary positional embeddings (RoPE) instead of absolute positional embedding. • Instead of layer normalization, Llama uses RMSNorm. Training datasets Llama's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters, reasoning that the dominating cost for LLMs is from doing inference on the trained model rather than the computational cost of the training process. Llama 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including: Llama 2 foundational models were trained on a data set with 2 trillion tokens. This data set was curated to remove Web sites that often disclose personal data of people. It also upsamples sources considered trustworthy. Fine-tuning Llama 1 models are only available as foundational models with self-supervised learning and without fine-tuning. Llama 2 – Chat models were derived from foundational Llama 2 models. Unlike GPT-4 which increased context length during fine-tuning, Llama 2 and Code Llama - Chat have the same context length of 4K tokens. Supervised fine-tuning used an autoregressive loss function with token loss on user prompts zeroed out. The batch size was 64. For AI alignment, human annotators wrote prompts and then compared two model outputs (a binary protocol), giving confidence levels and separate safety labels with veto power. Two separate reward models were trained from these preferences for safety and helpfulness using reinforcement learning from human feedback (RLHF). A major technical contribution is the departure from the exclusive use of proximal policy optimization (PPO) for RLHF – a new technique based on rejection sampling was used, followed by PPO. Multi-turn consistency in dialogs was targeted for improvement, to make sure that "system messages" (initial instructions, such as "speak in French" and "act like Napoleon") are respected during the dialog. This was accomplished using the new "Ghost attention" technique during training, which concatenates relevant instructions to each new user message but zeros out the loss function for tokens in the prompt (earlier parts of the dialog). ==Applications==

Applications

The Stanford University Institute for Human-Centered Artificial Intelligence (HAI) Center for Research on Foundation Models (CRFM) released Alpaca, a training recipe based on the Llama 7B model that uses the "Self-Instruct" method of instruction tuning to acquire capabilities comparable to the OpenAI GPT-3 series text-davinci-003 model at a modest cost. The model files were officially removed on March 21, 2023, over hosting costs and safety concerns, though the code and paper remain online for reference. Zoom used Meta Llama 2 to create an AI Companion that can summarize meetings, provide helpful presentation tips, and assist with message responses. This AI Companion is powered by multiple models, including Meta Llama 2. Reuters reported in 2024 that many Chinese foundation models relied on Llama models for their training. llama.cpp Software developer Georgi Gerganov released llama.cpp as open-source on March 10, 2023. It's a re-implementation of Llama in C++, allowing systems without a powerful GPU to run the model locally. The llama.cpp project introduced the GGUF file format, a binary format that stores both tensors and metadata. The format focuses on supporting different quantization types, which can reduce memory usage, and increase speed at the expense of lower model precision. llamafile created by Justine Tunney is an open-source tool that bundles llama.cpp with the model into a single executable file. Tunney et al. introduced new optimized matrix multiplication kernels for x86 and ARM CPUs, improving prompt evaluation performance for FP16 and 8-bit quantized data types. Space Booz Allen Hamilton deployed Meta’s Llama 3.2 model aboard the International Space Station (ISS) National Labs as part of a project called Space Llama. The system runs on Hewlett Packard Enterprise’s Spaceborne Computer‑2 and leverages Booz Allen’s A2E2 (AI for Edge Environments) platform, using NVIDIA CUDA‑accelerated computing. Space Llama demonstrates how large language models can operate in disconnected, constrained environments such as space, enabling astronauts to retrieve and summarize documents using natural-language queries, even without internet connectivity. Military In 2024, researchers from the People's Liberation Army Academy of Military Sciences (top military academy of China) were reported to have developed a military tool using Llama, which Meta Platforms stated was unauthorized due to Llama's license prohibiting the use of the model for military purposes. Meta granted the US government and US military contractors permission to use Llama in November 2024, but continued to prohibit military use by non-US entities. == Licensing ==

Licensing

The first version of Llama was released under a non-commercial license to some researchers and entities on a case-by-case basis. The OSI published The Open Source AI Definition (OSAID) in October 2024, which requires open-source AI to be released with details about its training data that Meta does not disclose for Llama. A Meta spokesperson responded to The Verge that the company disagrees with this definition. The Free Software Foundation classified Llama 3.1's license as a nonfree software license in January 2025, criticizing its acceptable use policy, restrictions against users with popular applications, and enforcement of trade regulations outside the user's jurisdiction. In its coverage of Llama 2, Ars Technica initially echoed Meta's use of the term open-source, but later revised its reporting to describe Llama as "source-available", "openly licensed", and "weights available" after the publication recognized that Llama 2's license disallowed entities with over 700 million daily active users from using the LLM and disallowed the LLM's outputs from being used to improve other LLMs. CIO, in November 2024, stated that Llama was not open-source due to its acceptable use policy, a 630-word document that "puts it at odds with the broader open-source movement". == Reception ==

Reception

Wired describes the 8B parameter version of Llama 3 as being "surprisingly capable" given its size. The response to Meta's integration of Llama into Facebook was mixed, with some users confused after Meta AI told a parental group that it had a child. The release of Llama models has sparked significant debates on the benefits and misuse risks of open-weight models. Such models can be fine-tuned to remove safeguards, notably by cyber criminals, until they comply with harmful requests. Some experts contend that future models may facilitate causing damage more than defending against it, for example by making it relatively easy to engineer advanced bioweapons without specialized knowledge. Conversely, open-weight models can be useful for a wide variety of purposes, including for safety research. Open Source Initiative head Stefano Maffulli criticized Meta for describing Llama as open-source, saying that it was causing confusion among users and "polluting" the term. == See also ==

Source: Wikipedia ↗

tickerdossier.com tickerdossier.substack.com