There is a strange asymmetry in the current AI hardware market. NVIDIA sells the shovels for the gold rush, but the most important number on the box is often not the number of CUDA cores, tensor cores, or advertised TOPS. It is the amount of VRAM. A 12 GB card can be computationally quite capable and still feel obsolete the moment a model, a context window, or a KV cache crosses the memory line. The GPU is not too slow. It is simply too small.
That is why GreenBoost is interesting. Not because it performs magic, and not because it abolishes the hierarchy between VRAM, system RAM, and storage. It does neither. Its value is more subversive: it attacks the artificial cliff where a workload that almost fits becomes a workload that does not run at all. GreenBoost is an independently developed, GPLv2 Linux kernel module paired with a CUDA user-space shim. Public descriptions present it as a transparent memory-extension layer for NVIDIA GPUs, using system DDR memory and even NVMe as overflow tiers for workloads such as local LLM inference. It is explicitly not an NVIDIA product, and it does not replace NVIDIA’s official driver stack; it sits beside it.
The basic trick is brutally practical. GreenBoost intercepts CUDA allocation calls through an LD_PRELOAD shim. Small allocations pass through to the normal CUDA runtime. Large allocations — the sort associated with model weights, KV caches, and other bulky tensors — are routed into an extended memory pool. On Linux, the kernel module allocates pinned DDR pages, reportedly using 2 MB compound pages, exports them as DMA-BUF file descriptors, and lets CUDA import them as external memory. From the application’s point of view, the allocation still looks like CUDA-accessible memory. From the hardware’s point of view, the bytes are not sitting in local VRAM; they are reached across PCIe. Phoronix quotes the project announcement as describing interception of cudaMalloc, cudaMallocAsync, cuMemAllocAsync, cudaFree, and cuMemFree, with a default large-allocation threshold around 256 MB.
That distinction matters. GreenBoost does not turn DDR into GDDR. It does not make PCIe behave like on-package memory. PCIe 4.0 x16 can move data at roughly 32 GB/s in the idealized direction people usually quote, while modern GPU VRAM bandwidth is often measured in hundreds of GB/s or more. If a model hammers “extended” memory as if it were hot VRAM, performance will collapse. The achievement is not bandwidth parity. The achievement is addressability without rewriting the inference backend. The model that previously failed with CUDA out-of-memory can now at least enter the arena.
This makes GreenBoost very different from llama.cpp-style partial offload. llama.cpp can explicitly place only some layers on the GPU and leave others in CPU RAM. That is stable, understandable, and often the correct engineering choice. But it also changes the execution pattern. CPU-resident layers are computed on the CPU, and the system pays both the CPU execution cost and the boundary-crossing cost between CPU and GPU phases. This is not a defect of llama.cpp; it is a deliberate compromise. Its --n-gpu-layers mechanism exists precisely because not every model can fit entirely in VRAM, and users need a controllable way to decide how much of the model is GPU-resident.
GreenBoost aims at a lower layer. It tries to make the memory allocation itself lie more usefully. The application asks for CUDA memory; the shim decides whether the allocation should be real VRAM or extended memory; the kernel module supplies pinned host memory through a path that CUDA can address. This is more transparent than application-level offload. It is also more dangerous, because transparency can hide the performance model. The developer gets a larger apparent memory space, but not a larger local memory bus.
The most obvious use case is local LLM inference on consumer GPUs. A 12 GB card is an awkward object in 2026: too good to throw away, too small for many of the models people actually want to run. Quantization helps, but at a quality cost. Smaller context windows help, but at a capability cost. CPU offload helps, but at a speed cost. GreenBoost offers a fourth bargain: keep the software mostly unchanged, let the GPU see a larger address space, and accept that overflow memory is slower. Phoronix reports that the project grew out of the desire to run a 31.8 GB model on a 12 GB GeForce RTX 5070, where existing offloading approaches worked but reduced token performance.
The Windows port is also revealing. It cannot merely copy the Linux DMA-BUF architecture, because Windows has different kernel primitives. The public Windows repository describes a KMDF driver plus a CUDA shim DLL using Microsoft Detours. Its README says the Windows path uses pinned 2 MB blocks mapped into user space through MDLs, with CUDA registration at the end; for modern GPUs it prefers CUDA Unified Virtual Memory and prefetching, while the driver-mapped pinned-page path remains a fallback. The port documentation is unusually candid about limitations: fallback access over PCIe is significantly slower, test signing is required, and the implementation is not yet production-hardened with Driver Verifier.
That candor is important because GreenBoost sits in a zone where hype comes easily. “A 12 GB GPU can suddenly address 60+ GB” is true in the same limited sense that a computer can address swap space. Addressable is not the same as fast. But the comparison to swap is also unfair, because GreenBoost is not just blind operating-system paging. It is a CUDA-aware intervention at the allocation boundary, with the potential to treat model-scale allocations differently from ordinary process memory. That is the interesting engineering idea: not pretending that all memory is equal, but exploiting the fact that many AI workloads fail because the software stack treats the VRAM limit as a hard binary wall.
NVIDIA could have delivered a polished version of this idea. It owns the driver stack, the CUDA runtime, the developer ecosystem, and the hardware roadmap. It already understands memory oversubscription, unified virtual addressing, peer-to-peer transfer, pinned memory, and managed memory better than anyone else. Yet the consumer experience remains oddly primitive: if your model does not fit, buy a larger GPU, quantize harder, offload manually, or suffer. GreenBoost is therefore not just a tool. It is a protest in code against product segmentation disguised as physics.
The open question is whether this approach can become more than a clever hack. For inference workloads with predictable access patterns, there may be room for smarter placement: hot tensors in VRAM, colder weights in pinned system RAM, emergency overflow on NVMe, and prefetch decisions informed by the model schedule. For random, bandwidth-hungry access, no shim can repeal the cost of distance. The future version of this idea would need profiling, tensor-aware policy, integration with backends, and brutal honesty in benchmarks. It would need to tell the user not only “the model runs,” but “this is where every millisecond went.”
Still, GreenBoost deserves attention because it changes the question. The usual consumer-GPU question is: “How small must the model become to fit the card?” GreenBoost asks: “How much of the model truly needs to live in VRAM at this moment?” That is a better question. It is also the question NVIDIA has little incentive to emphasize, because the simplest official answer is always the same: buy more VRAM.
GreenBoost will not make a 12 GB card behave like a 48 GB workstation GPU. But it may make the 12 GB card less absurdly constrained by a single number printed on the spec sheet. In the local AI world, that is already significant. The difference between unusable and slow-but-working is not cosmetic. It is the difference between a closed door and a door that opens with a warning sign.
No comments yet