Building Spectre: A CLI Coding Agent

Published: 2025-08-18

The Problem

I wanted to build a coding assistant that could run locally on my home server using llama.cpp. Most AI coding tools require sending your code to external services, but I wanted something that could work entirely offline while still being powerful and useful.

I had recently taken the time to set up a homelab, and part of this homelab I built out a semi-powerful computer for running AI models on. I've gotten it to a point where I can run a 30b parameter coding model from qwen. It works really well and is really fast.

I did some research and while Codex from OpenAI can connect to Ollama, it can't connect to llama.cpp, and it only supports OpenAI's open source models, that was the hold up for me. Ollama is great, includes a lot out of the box, but from my finding you're limited to the models they provide. I wanted to run any open source model, may this include OpenAI's open source ones, I wanted to decide.

The Solution

Spectre is a CLI coding agent designed specifically for local deployment. It can understand codebases, make intelligent suggestions, and help with development tasks without ever sending your code outside your local network.

It's a mini Claude Code. It has a lot of the same features, though some don't work as well and need better implementation. It's able to work on small projects and explain codebases really quick.

Key Features

Uses your localhost llama.cpp or if you have a computer running it somewhere, you can supply the url via LLAMA_API_URL
Via prompts can understand your codebase and provide decent explanations for how things are done.
Pulls the context of code before making changes, via a couple tool calls to get context and then patch the file.
Supports any model that you can run with llama.cpp, I currently use a 30b coding model from qwen.

Technical Challenges

Building a local AI coding assistant presented several interesting challenges...

Context Management: Efficiently managing code context and maintaining conversation history while keeping memory usage reasonable. My computer has a max of 12000 tokens it can hold in context, Claude Code and other similar tools can hold 100s of thousands of tokens in context, this is a difficult thing to manage, you want a strong prompt, but also want the code to be a large part of the context. I found a middle ground of letting the AI sort of do it's thing and have minimal guard rails.

Integration: Making the tool feel native to existing development workflows. There's some things that are missing that I would like to add over the next couple days, namely running CLI commands. This would get it to a point where I don't need to define tools for the AI to manipulate the code, it would be able to make directories and files with a simple command.

Resolving Errors: As making the tool, I found that this model is not as good as cloud offerings. It's just the nature of the game, I don't have 100s of thousands of dollars to throw at GPUs to run high performing models. To fix that we've added a way for the AI to go back on changes by storing a backup of the file it manipulated. You can quickly say undo the changes to a certain file and it will find the backup for you.

What's Next

I'm continuing to improve Spectre with better tools, to enhance codebase understanding, and more development workflow integrations. The goal is to make it as capable as cloud-based solutions while maintaining complete privacy and local control. While I can do a lot of this via new tools and improving of other tools, this will also come from using new models too.

The current implementation is really fast on the model I've chosen. Granted it helps that I have 2 24gb GPUs for it to run on as well as the model being fairly small at just 30b parameters. There were some challenges to get that running. Installing llama.cpp from source and getting it to recognize GPUs was not the most straight forward task in the world (in fact I still haven't gotten it to have curl installed alongside so I need the normal installed version of llama.cpp too).

I'm happy to say that it can run on any model that you can spin up in llama.cpp, your mileage may vary for how those models perform. When choosing a model for Spectre, you should prioritize models that support tool calling. The more parameters the better, and coding specific models will work better than more generalized models.

Check out the project on GitHub if you're interested in trying it out or contributing!

I'm currently looking for work, if you have a role that you think I'd be a good fit for feel free to reach out to me on LinkedIn.