Using a Deepseek R1 distill on a cloud VM

Modified on Tue, 29 Apr at 3:42 PM

This article follows on from Using a local AI model and shows how to set up and connect Omniscope to a larger Deepseek model in your cloud computing platform such as AWS, GCP or Azure, or even a local VM or server with sufficient GPU capacity.

We explore two providers (llama.cpp and vLLM) and two bit-sizes (original, in this case 16-bit bf16, and quantised to 4 bits). A 4-bit quantised model is typically 1/4 the size of the original bf16 model, and requires 1/4 of the GPU memory for the model weights (although the K/V cache and other overheads also require memory).

Note that the gold standard for AI integration in Omniscope is to use OpenAI's state of the art models such as 4o, 4o-mini, o3-mini. However we have had good results using open-weight models, as follows.

Prerequisites

You will need a Linux server (typically a linux cloud VM). We used GCP machine types a3-highgpu-1g and a3-highgpu-4g.
You should choose a base disk image that has Nvidia CUDA drivers and Nvidia container toolkit preinstalled. Otherwise you'll need to take care of that yourself, following the Nvidia instructions for each.

In our case, we used GCP base image c2-deeplearning-pytorch-2-4-cu124-v20250327-debian-11, which on first login asks if you want to install the nvidia driver (you do).
You may also need to:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Later on, you can execute nvidia-smi to look at what's happening with your GPUs.

Make sure the disk is large enough. We used a good performance SSD sized at 256gb.

Here's one of the machine types we used - a Spot instance of a 4x H100 80gb instance:
The VM will need sufficient GPUs. There are other options (subject to sufficient total VRAM), but we recommend specifically Nvidia H100 80gb GPUs:
- 1x H100 80gb for the 4-bit quantised 70B model (as provided by a3-highgpu-1g)
- 4x H100 80gb for the original bf16 70B model (as provided by a3-highgpu-4g - in fact, if you could choose this, a 3-GPU VM would suffice).
If running Omniscope in a separate project or locally, you'll need to configure your cloud firewall to allow your Omniscope server to reach this VM on port 8000.

Approach

We'll be using DeepSeek-R1-Distill-Llama-70B, or a quantised version of it. Here's the HuggingFace page, with notable parts highlighted:

From deepseek-ai, there are several smaller R1 distill variants, with lower GPU memory requirements, such as DeepSeek-R1-Distill-Qwen-14B.

Note that this is a distilled model, and not R1 itself, which is much larger. This effectively means that R1 itself has been used to train an existing model (in this case Llama 70B from Meta).

It is also a reasoning model, so has been trained to "think" before responding.

There are plenty of other models than Deepseek, too, such as the original Meta Llama models, Qwen, Google Gemma, etc. but this article focuses on the Deepseek Llama 70B distill.
We'll be running the inference engine (llama.cpp or vLLM) using docker. This makes things vastly simpler, with only CUDA (above) and being on an OS that can pass GPU control to docker (e.g. Linux, and not Mac) as the prerequisite.
We'll be referencing models on hugging face. This means they are downloaded automatically.
We'll be using Omniscope's AI settings in Admin (including in some cases some in-development features in the 2025.2 dev channel) and the Report Ninja via the Instant Dashboard report, to test it out.

Option 1: using Llama.cpp with quantisation

Llama.cpp was originally developed for CPU inference but is now also the go-to inference engine for running your own inference using GPUs with quantised models, when you don't have sufficient GPU capacity to run the full unquantised model. (You can use Llama.cpp for unquantised models too).

Here's the command to run (all on one line):

docker run -p 8000:8000 --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda -hf unsloth/DeepSeek-R1-Distill-Llama-70B-GGUF --port 8000 --host 0.0.0.0 --n-gpu-layers 999

This command runs the latest Cuda-enabled llama.cpp server, on port 8000, passing all GPUs to docker, and allowing a maximum of layers being run on GPU inference. It's loading the "gguf" format 4 bit quantised (Q4_K_M by default) variant put together by "unsloth" from HuggingFace.

Wait perhaps 20 minutes for everything. For docker to pull the image layers, the model to download, and be loaded into the GPUs and initialised. You'll see "starting the main loop" and "all slots are idle":

etc.

etc., finally:

Option 2a: using vLLM with quantisation

vLLM is more of a GPU-first engine, but it does support quantisation. We've found success with AWQ format. The command we used was (all on one line):

docker run --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ

Similarly to llama.cpp, this further uses the vllm openai compatible latest docker image, and loads an AWQ 4 bit quantised model put together by Valdemardi.

We found this could take a while longer to finish starting up. When done, you'll see "Application startup complete".

(See Addendum for more arguments)

Option 2b: using vLLM with full (unquantised) models

In this case you'll typically need the 4x H100 GPUs whereas the above options require just a single GPU.

docker run --gpus all -p 8000:8000 --ipc=host vllm/vllm-openai:latest --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tensor-parallel-size=4

Note the different model (the original from deepseek, shown in huggingface as bf16) and the extra argument to use 4 GPUs.

(See Addendum for more arguments)

Configuring Omniscope

Note: this section includes some in-development features in the 2025.2 dev channel, and is subject to change.

I'm assuming you noticed the point about opening up the firewall to allow Omniscope to access the VM. And you've followed the steps in the previous article Using a local AI model.

Provider

Start Omniscope, and open Admin > AI Settings.

Add a provider, choose Custom. If you have several, you should give it a suitable name.

Open the new provider's settings.

For the endpoint base URL, enter e.g. http://11.22.33.44:8000/

If using low-spec GPUs, where the performance might be very slow, consider increasing the Advanced > Response time to, say, 150.

For llama.cpp you're good to go. It should work with thinking and grammar restriction automatically / by default.

(Grammar restriction is an approach which requires the LLM to emit syntactically correct information in the form Omniscope needs to be able to reliably interpret it. Without this, you are relying upon the capability of the LLM to follow written instructions, and it'll occasionally get it wrong, such as referencing fields or settings that don't exist.)

Skip to the Integration settings below.

If using vLLM, you have a couple of choices:

(a) Thinking without grammar. The model will take longer to think, which you'll see when it completes, but the model output won't be constrained. For the 70B model you may get occasional mistakes, sometimes preventing Omniscope from interpreting the result. Slower. To use this approach, in "Model profile", specify

'System' role: supported,

Multiple 'system' messages: not supported,

Loose JSON mode: not supported,

Strictly constrained outputs: not supported.

(b) Grammar without thinking. The model won't take time to think, so might not be so sophisticated, but the grammar constraints means it'll generate correct results that Omniscope should always be able to interpret. Faster. To use this approach, in "Model profile", specify

'System' role: supported,

Multiple 'system' messages: not supported,

Loose JSON mode: not supported,

Strictly constrained outputs: supported,

Constraint style: OpenAI Structured Outputs.

(Unfortunately vLLM currently has a limitation which prevents the best of both worlds - grammar and thinking - from working. Update: with additional docker args, this can be addressed; see Addendum.)

Integration

Now configure the integration. In AI settings, below the providers, are integrations. Expand Report Ninja, and select the model from the default model drop-down. You'll find that llama.cpp or vLLM both report just one model. If you don't see anything, perhaps the firewall isn't open, or perhaps it hasn't finished loading the model.

You shouldn't need to configure anything else. Save, and move on.

Testing with Omniscope

Now go to Files, and click Instant Dashboard from the top "Create a project" section.

Drag a CSV file onto the drop area, and click Create.

You'll see the CSV file be imported, a project and report will be created, and the report will open with the Report Ninja sidebar showing with the beginnings of a conversation.

Now wait a few seconds. Depending on the model, GPU and provider, this might take 5 to 60 seconds (or worse on lower end GPUs). It might speed up after the first request.

A dashboard will then appear that the AI has designed for you (based on its understanding of the fields in your data). In the chat panel, you'll see the AI's response, which may or may not include an expandable "Thinking" section, depending on the configuration.

You can now continue the conversation with the AI, as desired. In your docker container's logs, you'll see activity when you ask questions.

Addendum: vLLM with reasoning and grammar (v0 engine)

Since this article was written, we've determined that vLLM can indeed work with thinking models and grammar constraints. This shows the additional docker arguments needed to support it:

docker run --gpus all -p 8000:8000 --ipc=host --env VLLM_USE_V1=0 vllm/vllm-openai:latest --model Valdemardi/DeepSeek-R1-Distill-Qwen-32B-AWQ --enable-reasoning --reasoning-parser deepseek_r1

(We expect the 'env' argument to be unnecessary very soon, since this PR is likely to go in very soon.)

You should then configure Omniscope's model profile:

'System' role: supported,

Multiple 'system' messages: not supported,

Loose JSON mode: not supported,

Strictly constrained outputs: supported,

Constraint style: OpenAI Structured Outputs.

Reminder: these documentation may refer to pre-release settings or settings only available in a daily build.