Run GPT-OSS models locally in Omniscope for full Privacy and AI power

Modified on Mon, 11 Aug at 5:27 AM

Introduction

This week, OpenAI released their much-awaited open source weights models, gpt-oss-120b and gpt-oss-20b. These are the first open weights models from OpenAI since GPT-2 in 2019.

“Open weights” means you can download the model file (containing billions of neural network parameters) and use it to run inference (i.e. perform text generation, via a chat UI or via apps like Omniscope which use APIs) using a 3rd party inference engine such as HuggingFace transformers, llama.cpp, vLLM, or lmstudio (entirely offline, without sending any data to a provider),, or via a 3rd party “serverless” provider such as HuggingFace and OpenRouter. Note that open weights isn’t fully open; it doesn’t mean that the training data or process was open.

This is a big deal: while you’ve been able to download open-weights models from others for some time, this announcement brings a lot in one go:

It’s from OpenAI, arguably the industry leader;
Released under a permissive Apache-2 licence (it’s a good thing);
High 128k context length (you can process a high volume of data in one shot);
Highly accessible (not requiring multiple high-end GPUs; 20b runs on commodity laptops, 120b on a single H100 GPU);
Natively quantized (MXFP4, approx 4 bits per parameter) - very rare amongst open weight models - so performs significantly better (speed, intelligence) for given hardware, and no fiddling around with 3rd party quants;
New capabilities regarding prompt roles, tool use, code execution and chain of thought (variable-effort reasoning), in part via…
New ‘harmony’ internal chat format geared towards the Responses API (although it’ll be some time until many providers and inference engines fully support this; Chat Completions will be the main API route for now);
Capability roughly equivalent to o4-mini (120b) and o3-mini (o3-mini) at least in terms of some benchmarks.

In this article we're going to try them out on a popular inference engine.

Using llama.cpp on Apple Silicon

Llama.cpp is a lightweight package for running LLMs such as gpt-oss locally, using CPU and/or GPU, on Mac, Linux and Windows. As soon as OpenAI released these models, the folks at llama.cpp immediately released initial support, impressive stuff.

In my case I've a 2 year old MacBook M2 Max with 32gb unified memory. But gpt-oss-20b should work on any M-series 16gb+ mac, and on an M4 series Pro or Max, you'll get even better performance.

On Mac, it's super easy to get up and running. I'm assuming you've got Homebrew installed already. Note that since we are using bleeding-edge support for gpt-oss, we'll have to use a slightly different way of installing llama.cpp. Ignore the first line if you've never used llama.cpp before:

brew uninstall --force llama.cpp

brew install --HEAD llama.cpp

Now let's start it up. I'm using the suggested command:

llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none

By default it starts a web chat app and API server running at http://localhost:8080/. You can open that and talk. This is completely offline, so anything you ask has to be baked into the 20 billion parameter model (12gb file). And it's fast.

To use in Omniscope (with an Enterprise licence, and using a forthcoming Autumn/Fall 2025 build):

Open Omniscope Admin > Labs and first ensure the Report Ninja labs setting is on. Reload the page after enabling. Then go to Admin > AI Settings and add a Custom provider. Enter http://localhost:8080/ into the endpoint option. No further customisation should be needed. Expand the Report Ninja and Workflow Execution sections, and pick the gpt-oss model in the "Default model" options for each section.

Save, and go to Files. Now use the Instant Dashboard project template and "drop'n'dash" a CSV dataset.

Alternatively, in an existing workflow, use the Instant Dashboard report preset from the block picker. The sit back and watch (not for very long - you won't need much popcorn).

It's fast and surprisingly sophisticated for a 20B model needing only 16gb memory. Very soon you'll see the thinking streaming by, and then the response build up, ending with the AI-generated dashboard appearing:

You can also ask questions and get answers visually.

And if you can't think of one...

Other ways to run and use gpt-oss models

As well as Report Ninja, you can also use gpt-oss (either variant) inside the Omniscope "AI Completion" block. Use AI in your workflow executions to make decisions about your data, perform data classification, and augment data with summaries or generated articles.

We also tested the larger gpt-oss-120b (similar to o4-mini) on a single H100 GPU running on a cloud Linux VM, with similar success and speed. This was again using llama.cpp (this time, built from source) and also using vLLM (via Docker and their gpt-oss specific docker image).

We did also tinker with other providers lmstudio and HuggingFace Transformers, but without as much success; however we expect all inference engines/providers to evolve very quickly in this area. And if you prefer not to, or can't, host your own models, you can look at HuggingFace inference providers or OpenRouter, for example, as 3rd party hosts, but you lose some of the benefits of "your data doesn't leave the building".

Please contact us if you're interested in hearing more about any of this.