Beginner’s Guide to Llama Models



This guide is for you if you are new to Llama, a free and open-source large language model. You will find some basic information and common questions.

What is Llama?

Llama (Large Language Model Meta AI) is a family of large language models (LLM). It is Meta (Facebook)’s answer to ChatGPT.

But the two company takes different paths. ChatGPT is proprietary. You don’t know the code of the model, the training data, and the training method. Llama is an open-source software. The code, training data, and the training code are out there in the public.

Llama is the first major open-source large language model. It gains instant popularity upon release. In addition to being free and open-source, it is pretty small and can be run on a personal computer. The 7-billion and 13-billion parameter models are very usable on a good consumer-grade PC.

How does Llama work?

LLama is an AI model designed to predict the next word. You can think of it as a glorified autocomplete. It is trained with text from the internet and other public dataset. Llama 2 is trained with about 2 trillion words.

You may wonder why the Llama model seems to be intelligent: It gives you sensible answers to difficult questions. It can rewrite your essay. It can give you pros and cons of certain things.

The training text was written by humans. In some sense, they are a slice of human thoughts projected on a medium. By learning how to complete a sentence, the model also learns an aspect of being human.

Does the Llama model know logic? There are two opposing views. One view is no because what the model designed to learn was correlation. It just predicts the next most probable word. Nothing more. The other view is yes. Suppose the training text is a murder story. It must learn to complete the last sentence, “The murderer is”. To predict the next word accurately, it has no choice but to learn logical deduction.

Why use LLama instead of ChatGPT?

ChatGPT is zero setup. A free version is available. Why use LLama? ChatGPT is indeed highly accessible. Here are the reasons why

  • Privacy. You can use Llama locally on your own computer. You don’t need to worry about the questions you asked being stored in a company’s server indefinitely.
  • Confidentiality. You may not be able to use ChatGPT for work-related queries because you are bounded by a non-disclosure agreement. You don’t have an NDA with OpenAI, after all.
  • Customization. There are many locally finetuned models you can choose from. If you don’t like the answers of a model, you can switch to another one.
  • Train your model. Finally, you have an opportunity to train your own model using techniques such as LoRA.

What can you do with Llama models?

You can use Llama models the same ways you use ChatGPT.

  • Chat. Just ask questions about things you want to know.
  • Coding. Ask for a short program to do something in a specific computer language.
  • Outlines. Giving an outline of certain technical topics.
  • Creative writing. Let the model write a story for you.
  • Information extraction. Summarize an essay. Ask specific questions about an essay.
  • Rewrite. Write your paragraph in a different tone and style.

 

What language does Llama support?

Mostly English. The training data is 90% English.  

Other languages, including German, French, Chinese, Spanish, Dutch, Italian, Japanese, Polish, Portuguese, and others. But don’t count on them.

This means you shouldn’t use Llama for translation tasks.

What computer hardware do I need?

It depends on the model size. The following are the VRAM needed for running on a GPU card with a GPTQ model.

Model 8-bit 4-bit
7B 10 GB 6 GB
13B 20 GB 10 GB
30 GB 40 GB 20 GB
70 GB 80 GB 40 GB

GPU VRAM requirement.

And the followings are for GGML models. (for Mac or CPU on Windows or Linux)

Model 4-bit qantized
7B 4 GB
13B 8 GB
30 GB 20 GB
70 GB 39 GB

RAM requirement.

 

What are quantized models?

Quantization is a method to reduce the models’ size while preserving quality. The benefit to you is the smaller size in your hard drive and requires less RAM to run.

What are the different versions of Llama?

Official models

There are two versions of the official models released by Meta — Llama 1 and Llama 2.

Llama 1

Llama 1 came out in February 2023. This release caused a big excitement because it was the first important LLM that was open-source. It was a big surprise back then, but now it seems like it was a long time ago. Llama 1 has spurred many efforts to fine-tune and optimize the model to run it locally. It was initially thought to be impossible to run a LLM locally. It was solved in a short period of time by hobbyists.

Llama 2

Although holding great promise, Llama 1 was released with a license that does not allow commercial use. This has limited the adoption of the Llama 1 model.

LLama 2 came out in July 2023. There are some incremental improvements in training and model architecture. The most significant change is the license term. Llama 2 is now free for commercial use. It is widely expected that this will spark a new round of development like what happened with Stable Diffuison.

Fine-tuned models

Unlike ChatGPT, you can make your own Llama model if you are unhappy with its response. You do that by teaching it with additional data. This is called fine-tuning.

Here are some popular fine-tuned models.

WizardLM

 

WizardLM is a family of models fine-tuned with many instruction-following conversations. The novelty of this model is using an LLM to generate training data automatically.

Download links

Model Base model Download links
WizardLM 7B uncensored Llama 1 GPTQ, ggml
WizardLM 13B V1.1 LLama 1 GPTQ, ggml
WizardLM 30B V1.0 LLama 1 GPTQggml

WizardLM models

Vicuna

 

Vicuna is fine-tuned with ChatGPT conversations.

Model Base model Download links
Vicuna 7B v1.3 Llama 1  GPTQggml
Vicuna 13B v1.3 LLama 1 GPTQggml
Vicuna 30B v1.3 LLama 1 GPTQggml

Vicuna models

How to compare the performance of models?

There are so many models to choose from. How do you know which is the best, whatever that means? How to compare the Llama models with ChatGPT?

LMSYS hosts a leadership board to compare the performance of LLMs, including proprietary ones like ChatGPT. They measure 3 metrics:

  • Chatbot Arena: The answers of two LLMs are presented to users blindly and let users pick the better one. A ranking score is then calculated for each LLM.
  • MT-bench: Use GPT-4 to judge the answers LLM (This metric favors GPT models.).
  • Massive Multitask Language Understanding (MMLU): Test the LLM in 57 tasks, including elementary mathematics, US history, computer science, law, and more.

Which file format should I use?

If you have an Nvidia GPU card, the GPTQ format gives you the best performance.

If you use Mac, Windows without GPU, or Linux without GPU, use the GGML format.

How to install Llama models?

See the installation guide for Windows and the installation guide for Mac.

What is the software to use Llama?

Text-generation-webui is a graphical user interface for using Llama model. It is powerful and easy to use. I recommend this software for general users.

If you prefer a text-only experience and is comfortable with using Terminals, llama.cpp is a good choice.

Can I use Llama commercially?

No for Llama 1.

Yes for Llama 2.

Beginner’s guide to Llama models