Running LLMs Locally

I've discovered what would get me obsessed with something: make a machine work but not exactly the way I want it.

Something has come to a point where I can't ignore it anymore: being able to run a large language model (LLM) locally without a GPU. I don't know how good the results are going to be but the potential upsides are too good to not try.

My ThinkPad has 8-core and 16GB of RAM, it's worth a shot. I have two major objectives: a chatbot as thinking-partner, and a coding sidekick similar to Github Copilot.

The first tool I landed on was alpaca.cpp, but that has quickly deprecated in favor of llama.cpp. These are specifically designed to run LLMs on CPUs. Getting hold of this is as easy as a Docker image.

Next is grabbing the brains for it to run. Not knowing my way around, I opted for the de facto one: LLaMA from Facebook. Finding the source to download turned out to be tricky, I ultimately ended up downloading it as a torrent. Considering my machine capacity, I opted for 7B-size model, the smallest possible that consumes 4GB of RAM.

First thing llama.cpp instructed to do is to convert the model into something called ggmt. The source model was 13GB size, the resulting ggmt model that's meant to run by llama.cpp is 4GB.

The show now begins, I now get to run it with prompts. The command is simple enough, it looks something like this:

docker run --rm \
       -v ~/models/llama:/models \
       --entrypoint '/app/main' \
       ghcr.io/ggerganov/llama.cpp:full \
       --model /models/7B/ggml-model-q4_0.bin \
       --n_predict -1 \
       --prompt "How now brown cow?"

With a prompt like that I wouldn't know what to expect. But what comes out sounded like someone who couldn't stop babbling. It went on and on, pretty soon it forgot what it was talking about and went off to another tangent.

This feels nothing like how ChatGPT behaves, let alone its knowledge and coherence. I was clearly doing something wrong.

I had to look at the tweaking parameters offered by llama.cpp, only a handful of them are within my pay grade.

One of the obvious prompts I tried it with was "write a Python function for bubble sort". That it performed successfully.

"An elisp function for Fibonacci sequence" though sounded like a similar task but it ended up babbling nonsense too.

I decided LLaMA might be useful as a coding sidekick. How about as a chatbot? On this it either failed spectacularly or I was making it do something it's not meant to.

Any form of prompt I tried and tweaking of llama.cpp parameters result in it babbling monologue. It would not even stop and wait for my turn to speak, despite what llama.cpp promised. I know enough humans who do that, I don't need it in a computer.

So here's the thing: I've gotten it far enough to work, but not far enough to be useful. And now I can't stop thinking about it.

Am I prompting it wrong, or am I not setting llama.cpp correctly, or both?

Eventually I ended up on Reddit and discovered some things I missed. Turned out different LLMs do differ quite a bit and good for unique use cases. LLaMA allegedly is good for text-completion, which explains how it wrote the code I wanted. Chatbots however is the forte of other models. That would be my upcoming attempts.

There's no plan to use LLMs to write my posts. If anything I might use them to cut for length.

So why go through the troubles to self-run these when I can simply open ChatGPT on the browser?

For one, I don't have to pay someone for API-use or even have internet access for it to work.

But most of all it's the same reason why I use emacs.

The more mission critical your tool is, the more it needs to be self-owned.
— ykgoon.com (@ykgoon) May 1, 2023

LLMs are not mission-critical for me yet. But I can own one, then there's room for me to allow it to be.