Using a local AI model

Modified on Wed, 26 Mar at 4:17 PM

Omniscope Evo with an Enterprise licence now lets you configure OpenAI-compatible providers (namely, LLM inference engines providing an OpenAI compatible chat completions endpoint). 


There are many models, model variants/quantisations and providers available for running locally on your laptop, on prem in a company server, or self-hosted on a private cloud. This article focuses on a single simple example: llama.cpp on a MacBook Pro M2 (or similar/later) running Qwen_QwQ-32B-Q4_K_M.gguf.


For other providers or environments, consult the provider


Install llama.cpp


The excellent Llama.cpp supports CPU-only, and GPU accelerated (Apple Silicon is a first-class example) AI models in the gguf format as widely available under the "Quantizations" section for a given model on huggingface.co (allowing much larger models to be run on smaller commodity hardware).


On a Mac (such as a MacBook Pro M2 with 32gb memory), install llama.cpp following their instructions. We recommend using brew:

brew install llama.cpp


Download a model


Visit huggingface.co and find the model you want. For example today I'm playing with https://huggingface.co/Qwen/QwQ-32B.


On the right, follow the Quantizations section. Typically there'll be several people offering these. You can usually pick the top one. Bartowski or Unsloth are good bets, but in this cases, Qwen themselves provide the top choice: https://huggingface.co/Qwen/QwQ-32B-GGUF


Under "Files and versions", pick the .gguf file you want. If unsure, pick the q4_k_m variant. In this case it's:

qwq-32b-q4_k_m.gguf


(meaning a 32 billion parameter model, quantized to approx. 4.8 bits per weight).


It's a big file. This example is 20gb.


Run the model


In Terminal, run the following, adapted as required:


llama-server --port 8081 -m models/qwq-32b-q4_k_m.gguf

 

Configure Omniscope


Following on from the introduction in How to enable AI in Omniscope, go back into AI Settings in Omniscope, and add a Custom provider. You'll need an Enterprise licence; contact support@visokio.com for a trial.


All you will need to configure is the Endpoint base URL. For example, I've specified port 8081 above (the default is 8080) and so the base URL when running llama.cpp on the same machine as Omniscope will be http://127.0.0.1:8080/ as shown here:


Now go to the Report Ninja integration settings and pick the model in the Default model drop-down.


Try it out


Now try using Instant Dashboard on the home page, to upload a data file and see a dashboard. 



Note: large local models like 32B ones are not very "instant" and can take a minute or more to respond. You may need to go to the Custom provider Advanced settings and increase the response timeout.


Performance


If performance is poor, try smaller models such as DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf, which is less capable and more likely to misunderstand you, or not understand the Omniscope dashboard completely, but is much faster. Also consider using a newer, faster machine, such as the latest Macbook Pro.


Caveats


Local models are typically only compatible with Report Ninja. The AI Block and Custom SQL typically require real OpenAI models rather than OpenAI-compatible models.

Smaller models will not work sufficiently to be useful.

GPU acceleration is essential for sufficient response times.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article