Running Llama-7B on Windows CPU or GPU

This post is being written during a time of quick change, so chances are it’ll be out of date within a matter of days; for now, if you’re looking to run Llama 7B on Windows, here are some quick steps.

Code Repo: https://github.com/treadon/llama-7b-example

Start by running PowerShell. Create a new directory and enter it.

mkdir llama
cd llama

I am assuming you have Python and PIP already installed, if not you can find steps on ChatGPT.

Next you need to create a Python virtual environment, you can do this without a virtual environment, but as of now it requires using nightly builds of Pytorch (for flash attention) and an unmerged branch of transformers.

python -m venv .venv
.\.venv\Scripts\Activate.ps1

This should create and activate a virtual Python environment. Next we’re going to install everything you need:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118
git+https://github.com/huggingface/transformers
pip install sentencepiece

This will take a few moments.

Now create a file called llama.py with the following body:

import transformers

device = "cpu"

tokenizer = transformers.LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
model = transformers.LlamaForCausalLM.from_pretrained("decapoda-research/llama-7b-hf").to(device)

batch = tokenizer(
    "The capital of Canada is",
    return_tensors="pt", 
    add_special_tokens=False
)

batch = {k: v.to(device) for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))

That’s all there is to it! Use the command “python llama.py” to run it, you should be told the capital of Canada! You can modify the above code as you desire to get the most out of Llama!

You can replace “cpu” with “cuda” to use your GPU.