How to run Local LLMs on Windows with NVIDIA (llama.cpp + CUDA)

Modified on Sun, 14 Dec at 6:46 PM

If you’ve ever run llama.cpp on a Mac and then tried to do the same thing on Windows with an NVIDIA GPU, you already know the truth: it’s doable, but it’s not plug-and-play.

This post documents a real, end-to-end setup on Windows 11 + RTX 4070 (8GB VRAM), including the gotchas, missing DLLs, wrong CUDA versions, and, most importantly, which models actually make sense on this hardware.


TL;DR (if you just want it to work)

  • Use a CUDA-enabled llama.cpp build (not winget)

  • Install CUDA 12.x runtime (not 13.x)

  • If PATH fails (it often does): copy CUDA DLLs next to the exe

  • Don’t run 32B models on 8GB VRAM unless you enjoy pain

  • Qwen2.5-14B Q4_K_M fits perfectly and runs fully on GPU




1. Hardware reality check (important)

This guide is based on this setup:

  • Windows 11

  • NVIDIA GPU (RTX 4070 Laptop / NUC class, 8GB VRAM)

  • 32GB system RAM

You cannot run 32B models fully on GPU with 8GB VRAM. They will spill into RAM and performance will suffer. The sweet spot is 14B models.


2. Get the right llama.cpp build (do NOT use winget)

winget install llama.cpp
This often installs a CPU or Vulkan-only build.

✅ Instead:

  • Go to the official llama.cpp GitHub Releases

  • Download the Windows CUDA build
    Example name:

    llama-b7399-bin-win-cuda-12.4-x64.zip
  • Extract it somewhere simple, e.g.:

    C:\Users\YourName\llama

You should see:

  • llama-server.exe

  • llama-cli.exe

  • ggml-cuda.dll


3. Install CUDA (this is where most people break things)

❌ Don’t do this

  • Installing only CUDA 13.x

  • Assuming “latest = best”

If your llama.cpp build is compiled for CUDA 12.x, CUDA 13 will not work.

✅ Do this instead

  • Install CUDA 12.x runtime (12.4 / 12.6 / 12.8 all work)

  • You do not need the full developer toolkit

After installation, you should have these DLLs:

  • cudart64_12.dll

  • cublas64_12.dll

  • cublasLt64_12.dll


4. Windows PATH will betray you (here’s the fix)

Even after installing CUDA, Windows often fails to expose the DLLs properly.

The reliable solution (just do this)

Copy these three files:

  • cudart64_12.dll

  • cublas64_12.dll

  • cublasLt64_12.dll

From:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x\bin

Into the same folder as:

llama-server.exe ggml-cuda.dll

Yes, it’s ugly.
Yes, it works.


5. Verify CUDA is actually working

From the llama folder, run:

.\llama-server.exe --list-devices

✅ You want to see something like:

CUDA0: NVIDIA GeForce RTX 4070

If you only see CPU or Vulkan devices, CUDA is not loaded.


6. Pick the right model (this matters more than flags)

❌ What not to run on 8GB VRAM

  • 32B models (QwQ-32B, Qwen-32B, DeepSeek-32B)

  • You’ll get CPU/RAM spill and poor performance

✅ The sweet spot

Qwen2.5-14B-Instruct (Q4_K_M, GGUF)

Why:

  • Fits fully in ~8.1GB VRAM

  • All layers offload to GPU

  • No RAM thrash

  • Fast and stable

The model comes split into files like:

qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf qwen2.5-14b-instruct-q4_k_m-00002-of-00003.gguf qwen2.5-14b-instruct-q4_k_m-00003-of-00003.gguf

? Only point llama.cpp at 00001.
It will load the rest automatically.


7. Run the server (recommended command)

.\llama-server.exe ^ --port 8081 ^ -m "..\AI Models\qwen2.5-14b-instruct-q4_k_m-00001-of-00003.gguf" ^ -ngl 999 ^ -c 4096

What this does:

  • -ngl 999 → offload everything that fits to GPU

  • -c 4096 → safe context size for 8GB VRAM

What success looks like

You’ll see logs like:

offloaded 49/49 layers to GPU CUDA0 model buffer size = ~8 GB CPU_Mapped model buffer size = ~400 MB

This means the model is fully GPU-resident.


8. Sanity check performance

While generating text, run:

nvidia-smi -l 1

You should see:

  • VRAM ~8GB used

  • GPU utilization > 0%

  • CPU usage relatively low

If CPU is pegged and GPU is idle, something’s wrong.


Final thoughts (hard truth)

  • NVIDIA laptops are VRAM-limited, not compute-limited

  • For local LLMs, memory architecture > raw GPU power

  • Macs feel “better” for big models because of unified memory

  • On Windows + NVIDIA, model choice is everything

Once you pick the right model size, llama.cpp with CUDA is fast, stable, and absolutely usable; but getting there requires jumping through a few very Windows-specific hoops.


Hopefully this saves someone else a weekend :)

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article