How to Run LLMs on Your CPU with Llama.cpp: A Step-by-Step Guide

Large language models (LLMs) are becoming increasingly popular, but they can be computationally expensive to run. In this blog post, we will see how to use the llama.cpp library in Python to run LLMs on CPUs with high performance.

Large language models (LLMs) are becoming increasingly popular, but they can be computationally expensive to run. There have been several advancements like the support for 4-bit and 8-bit loading of models on HuggingFace. But they require a GPU to work. This has limited their use to people with access to specialized hardware, such as GPUs. Even though it is possible to run these LLMs on CPUs, the performance is limited and hence restricts the usage of these models.

Recent work by Georgi Gerganov has made it possible to run LLMs on CPUs with high performance. This is thanks to his implementation of the llama.cpp library, which provides high-speed inference for a variety of LLMs.

The original llama.cpp library focuses on running the models locally in a shell. This does not offer a lot of flexibility to the user and makes it hard for the user to leverage the vast range of python libraries to build applications. Recently LLM frameworks like LangChain have added support for llama.cpp using the llama-cpp-python package.

In this blog post, we will see how to use the llama.cpp library in Python using the llama-cpp-python package. This package provides Python bindings for llama.cpp, which makes it easy to use the library in Python.

We will also see how to use the llama-cpp-python library to run the Vicuna LLM, which is an open-source model based on the LLaMA architecture that behaves like ChatGPT.

Set up llama-cpp-python

Setting up the python bindings is as simple as running the following command:

pip install llama-cpp-python

For more detailed installation instructions, please see the llama-cpp-python documentation: https://github.com/abetlen/llama-cpp-python#installation-from-pypi-recommended.

Using a LLM with llama-cpp-python

Once you have installed the llama-cpp-python package, you can start using it to run LLMs.

You can use any language model with llama.cpp provided that it has been converted to the GGML format. There are already GGML versions available for most popular LLMs and the required GGML can be easily found on HuggingFace.

An important thing to note is that the original LLMs have been quantized when converting them to GGML format. This helps reduce the memory requirement for running these large models, without a significant loss in performance. For example, this helps us load a 7 billion parameter model of size 13GB in less than 4GB of RAM.

In this article we use the GGML version of Vicuna-7B which is available on the Hugging Face Hub.

The model can be downloaded from HuggingFace: https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized.

Downloading the GGML file and Loading the LLM

The following code can be used to download the model. The code downloads the required GGML file, in this case the Vicuna-7b-Q4.1 GGML, from the Hugging Face Hub. The code also checks if the file is already present before attempting to download it.

import os
import urllib.request


def download_file(file_link, filename):
# Checks if the file already exists before downloading
if not os.path.isfile(filename):
urllib.request.urlretrieve(file_link, filename)
print("File downloaded successfully.")
else:
print("File already exists.")

# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"

download_file(ggml_model_path, filename)

The next step is to load the model that you want to use. This can be done using the following code:

from llama_cpp import Llama

llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)

There are two important parameters that should be set when loading the model.

  • n_ctx: This is used to set the maximum context size of the model. The default value is 512 tokens.

The context size is the sum of the number of tokens in the input prompt and the max number of tokens that can be generated by the model. A model with smaller context size generates text much quicker than a model with a larger context size. If the use case does not demand very long generations or prompts, it is better to reduce the context length for better performance.

The number of tokens in the prompt and generated text can be checked using the free Tokenizer tool by OpenAI.

  • n_batch: This is used to set the maximum number of prompt tokens to batch together when generating the text. The default value is 512 tokens.

The n_batch parameter should be set carefully. Lowering the n_batch helps speed up text generation over multithreaded CPUs. Reducing it too much may cause the text generation to deteriorate significantly.

The complete list of parameters can be viewed here: https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama

Generating Text using the LLM

The following code writes a simple wrapper function to generate text using the LLM.

def generate_text(
prompt="Who is the CEO of Apple?",
max_tokens=256,
temperature=0.1,
top_p=0.5,
echo=False,
stop=["#"],
):
output = llm(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
echo=echo,
stop=stop,
)
output_text = output["choices"][0]["text"].strip()
return output_text

The llm object has several important parameters that are used while generating text:

  • prompt: The input prompt to the model. This text is tokenized and passed to the model.
  • max_tokens: The parameter is used to set the maximum number of tokens the model can generate. This parameter controls the length of text generation. Default value is 128 tokens.
  • temperature: The token sampling temperature to use, between 0 and 1. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. Default value is 1.
  • top_p: An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  • echo: Boolean parameter to control whether the model returns (echoes) the model prompt at the beginning of the generated text.
  • stop: A list of strings that is used to stop text generation. If the model encounters any of the strings, the text generation will be stopped at that token. Used to control model hallucination and prevent the model from generating unnecessary text.

The llm object returns a dictionary object of the form:

{
"id": "xxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", # text generation id
"object": "text_completion", # object name
"created": 1679561337, # time stamp
"model": "./models/7B/ggml-model.bin", # model path
"choices": [
{
"text": "Q: Name the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune and Pluto.", # generated text
"index": 0,
"logprobs": None,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 14, # Number of tokens present in the prompt
"completion_tokens": 28, # Number of tokens present in the generated text
"total_tokens": 42
}
}

The generated text can be easily extracted from the dictionary object using output[“choices”][0][“text”].

Example — Text generation using Vicuna-7B

import os
import urllib.request
from llama_cpp import Llama


def download_file(file_link, filename):
# Checks if the file already exists before downloading
if not os.path.isfile(filename):
urllib.request.urlretrieve(file_link, filename)
print("File downloaded successfully.")
else:
print("File already exists.")


# Dowloading GGML model from HuggingFace
ggml_model_path = "https://huggingface.co/CRD716/ggml-vicuna-1.1-quantized/resolve/main/ggml-vicuna-7b-1.1-q4_1.bin"
filename = "ggml-vicuna-7b-1.1-q4_1.bin"

download_file(ggml_model_path, filename)


llm = Llama(model_path="ggml-vicuna-7b-1.1-q4_1.bin", n_ctx=512, n_batch=126)


def generate_text(
prompt="Who is the CEO of Apple?",
max_tokens=256,
temperature=0.1,
top_p=0.5,
echo=False,
stop=["#"],
):
output = llm(
prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
echo=echo,
stop=stop,
)
output_text = output["choices"][0]["text"].strip()
return output_text


generate_text(
"Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.",
max_tokens=356,
)

Generated text:

Hawaii is a state located in the United States of America that is known for its beautiful beaches, lush landscapes, and rich culture. It is made up of six islands: Oahu, Maui, Kauai, Lanai, Molokai, and Hawaii (also known as the Big Island). Each island has its own unique attractions and experiences to offer visitors.
One of the most interesting cultural experiences in Hawaii is visiting a traditional Hawaiian village or ahupuaa. An ahupuaa is a system of land use that was used by ancient Hawaiians to manage their resources sustainably. It consists of a coastal area, a freshwater stream, and the surrounding uplands and forests. Visitors can learn about this traditional way of life at the Polynesian Cultural Center in Oahu or by visiting a traditional Hawaiian village on one of the other islands.
Another must-see attraction in Hawaii is the Pearl Harbor Memorial. This historic site commemorates the attack on Pearl Harbor on December 7, 1941, which led to the United States' entry into World War II. Visitors can see the USS Arizona Memorial, a memorial that sits above the sunken battleship USS Arizona and provides an overview of the attack. They can also visit other museums and exhibits on the site to learn more about this important event in American history.
Hawaii is also known for its beautiful beaches and crystal clear waters, which are perfect for swimming, snorkeling, and sunbathing.

The Jupyter Notebook with the example can be viewed interactively on NbViewer.

The complete code for running the examples can be found on GitHub.

Conclusion

In this blog post, we explored how to use the llama.cpp library in Python with the llama-cpp-python package. These tools enable high-performance CPU-based execution of LLMs.

llama.cpp is updated almost every day. The speed of inference is getting better, and the community regularly adds support for new models. You can also convert your own Pytorch language models into the ggml format. llama.cpp has a “convert.py” that will do that for you.

The llama.cpp library and llama-cpp-python package provide robust solutions for running LLMs efficiently on CPUs. If you’re interested in incorporating LLMs into your applications, I recommend exploring these resources.

Resources:

Llama-cpp-python: https://github.com/abetlen/llama-cpp-python
Example Code — GitHub: https://github.com/awinml/llama-cpp-python-bindings

Originally published at https://awinml.github.io on June 30, 2023.


How to Run LLMs on Your CPU with Llama.cpp: A Step-by-Step Guide was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.

Original Post>