{ "@context": "https://schema.org", "@type": "BlogPosting", "headline": "what is vram", "description": "Learn what VRAM is, how GPU memory works, and why it matters for AI and gaming. Covers HBM vs GDDR, memory requirements, and performance in March 2026.", "datePublished": "2026-04-16T18:02:30.937Z", "dateModified": "2026-04-16T18:02:30.937Z", "author": { "@type": "Person", "name": "stone berk" }, "url": "https://www.thundercompute.com/blog/what-is-vram", "wordCount": 1884, "keywords": [ "what is vram" ] }
Everyone focuses on GPU speed, but what VRAM is and how much you have determines whether your model loads at all. A 70B model needs 140GB in FP16, which won't fit on most single GPUs no matter how fast they are. The type of memory matters too because HBM delivers over 3 TB/s
TLDR:
VRAM stands for Video Random Access Memory. It's the dedicated memory that sits directly on your graphics card, separate from the system RAM connected to your CPU.
Think of VRAM as the GPU's private workspace. While your system RAM handles general computing tasks, VRAM stores everything your GPU need
The architecture differs from system RAM in key ways. VRAM sits on the same circuit board as the GPU, connected through a wide, high-speed memory bus. This setup allows for much higher bandwidth compared to system RAM, which connects to the CPU through a different pathway.
Graphics cards use two main types of memory: GDDR
GDDR6, the current standard for consumer GPUs, connects to the graphics chip through a traditional PCB layout. It offers decent bandwidth at lower costs, making it the go-to choice for gaming cards and entry-level workstations. GDDR6X pushes this further with higher transfer rates.
HBM stacks memory dies vertically, sitting much closer to the GPU die. This design delivers superior bandwidth and lower power consumption compared to GDDR variants. Data centers favor HBM2 and HBM3 for AI accelerators like the A100 and H100, where memory bandwidth directly impacts training speed.
The tradeoff comes down to cost versus performance. HBM memory costs more to manufacture but pays off for compute-heavy workloads that constantly move large datasets.
GDDR chips sit in individual modules soldered around the GPU die on the printed circuit board. This modular layout allows manufacturers to adjust memory capacity by adding or removing chips, which is why you see GPUs with varying VRAM configurations using the same core architecture.
GDDR6X introduced PAM4 (4-level pulse amplitude modulation) signaling, which transmits two bits per cycle instead of one. This doubles data rates without requiring higher clock speeds, reducing heat output while pushing bandwidth past 1 TB/s on high-end cards.
HBM stacks multiple memory dies vertically, connected through microscopic through-silicon vias. These stacks sit directly beside the GPU die on a silicon interposer, creating an extremely wide data path.
Where GDDR maxes out around 384-bit bus width, HBM configurations reach 1024-bit and beyond. HBM3 can hit over 3 TB/s bandwidth per stack. HBM costs roughly 3-4x more than equivalent GDDR capacity, which is why data center GPUs pay for the expense while consumer cards stick with GDDR. Cloud providers offering both options provide alternatives to platforms like Runpod for accessing these high-memory GPUs.
| Memory Type | Bus Width | Band | Typical Capacity | Primary Use Case |
|---|---|---|---|---|
| GDDR6 | 384-bit max | Up to 768 GB/s | 8-24 GB | Consumer GPUs, gaming |
| GDDR6X | 384-bit max | Up to 1 TB/s | 12-24 GB | High-end gaming GPUs |
| HBM2 | 1024-bit | Up to 307 GB/s | 8-24 GB per stack | Data center GPUs |
| HBM3 | 1024-5120-bit | Up to 3.35 TB/s | 16-24 GB per stack | AI accelerators, HPC |
The counterintuitive part: an individual HBM chip runs slower than a single GDDR6 chip. But HBM's architecture turns this into an advantage through parallelization.
HBM3 represents the current generation. The NVIDIA H100 80GB pairs its tensor cores with HBM3 memory running a 5120-bit bus at 3.35 TB/s bandwidth. That's roughly triple what you get from GDDR6X cards.
The scalability factor matters too. Stack more HBM dies, and bandwidth scales almost linearly. GDDR6 hits physical limitations much sooner due to PCB routing constraints and signal integrity issues at higher speeds.
System RAM connects to your CPU through the motherboard, typically using a 64-bit or 128-bit memory bus. VRAM sits millimeters from the GPU cores on the same board, using bus widths ranging from 256-bit to 5120-bit.
This placement difference creates a massive bandwidth gap. DDR5 system RAM tops out around 80 GB/s per channel. GPU memory delivers 600 GB/s to 3+ TB/s depending on the configuration. GPUs need this speed because they process thousands of operations at once, while CPUs handle sequential tasks.
When your GPU runs out of VRAM, the system pages data back to system RAM over the PCIe bus. PCIe 4.0 x16 maxes out at roughly 32 GB/s in each direction, far slower than direct VRAM access. This creates stuttering in real-time applications and extends training times for AI models.
AI workloads place heavy demands on VRAM. LLM inference in half-precision requires roughly 2 GB per billion parameters for weights alone. A 13B parameter model needs about 26 GB before KV cache and framework overhead. Full fine-tuning demands far more because optimizer states, gradients, and activations can push requirements to 4-6x the inference baseline.
Quantization reduces memory by lowering precision. INT8 cuts VRAM needs in half versus FP16, while INT4 drops to one-quarter. A 70B model needs 140 GB in FP16 but only 35 GB after INT4 quantization.
| Model Size | FP16 Inference | INT4 Quantized | Recommended GPU VRAM |
|---|---|---|---|
| 7B params | ~14 GB | ~3.5 GB | 16 GB minimum |
| 13B params | ~26 GB | ~6.5 GB | 24-32 GB |
| 70B params | ~140 GB | ~35 GB | 80 GB or multi-GPU |
Training memory requirements go beyond simple model weights. Full fine-tuning needs 12-18 GB per billion parameters, accounting for optimizer states, gradients, and activation checkpointing during backpropagation.
Adam optimizer alone stores two additional copies of model parameters for momentum and variance calculations. That triples memory overhead before gradients and activations enter the picture.
Parameter-efficient fine-tuning methods like LoRA cut these requirements by freezing base model weights and training small adapter layers. This drops training memory closer to inference levels while maintaining quality.
Batch size directly impacts VRAM consumption during training. Larger batches improve GPU utilization but require more memory for activations. Gradient accumulation simulates larger batches by processing smaller chunks sequentially.
An 80GB A100 handles most single-GPU training tasks, while distributed setups across multiple GPUs split memory demands for larger models.
VRAM capacity sets hard limits on which models you can deploy. Loading a 70B parameter model on a 24GB GPU triggers out-of-memory errors before inference begins, no matter how fast the cores run.
This shapes cloud spending directly. An A100 with 40GB costs less per hour than the 80GB version. If your model fits in 40GB, paying for 80GB wastes budget on unused capacity, so budget GPU providers help match your needs to the right configuration.
Batch size scales with available memory. More VRAM means processing more requests at once, boosting throughput and cutting per-inference costs. The KV cache expands with every token processed, so longer context windows consume VRAM quickly. A 4K context window on a 13B model requires far less memory than 32K context on the same architecture.
Insufficient VRAM forces model sharding across multiple GPUs, adding network latency and complicating deployment versus running everything on one high-memory instance.
Memory requirements extend beyond static model weights during inference. The KV cache stores attention keys and values for every token processed, growing linearly with sequence length. A 70B parameter model generating long responses can consume tens of gigabytes just for this cache.
Even if you load a model successfully, the KV cache expands with each generated token. A 4K token conversation requires far less memory than a 16K token context, even with identical model architecture. This growth catches people off guard when deploying LLMs in production.
The 70B FP16 example proves the point: 140GB of weight data exceeds what fits on a single A100 80GB. You need quantization to INT4 or multi-GPU setups to run it. Multi-GPU training splits memory across instances, but adds complexity to your deployment setup.
Picking the right amount of VRAM separates efficient deployments from budget-draining mistakes. Knowing what VRAM your models need before you spin up instances saves you from paying for 80GB cards when 40GB would handle the job. Memory bandwidth shapes training speed and inference throughput more directly than core counts, so start your planning with VRAM specs and work outward from there. Get the memory right and everything else falls into place.\u2014mess it up and you're stuck with hardware that can't run your models or costs twice what you should be paying.
You need around 26 GB of VRAM for inference in half-precision (FP16), but this drops to about 6.5 GB with INT4 quantization. A GPU with 24-32 GB of VRAM handles most use cases comfortably.
GDDR memory uses individual chips arranged around the GPU, while HBM stacks memory dies vertically next to the GPU chip for much higher bandwidth. HBM costs 3-4x more but delivers up to 3+ TB/s bandwidth compared to GDDR6's typical 600-800 GB/s.
When VRAM runs out, your system swaps data with slower system RAM over the PCIe bus, which maxes out around 32 GB/s versus VRAM's 600+ GB/s speeds. This creates a bottleneck that tanks performance in training and inference.
The KV cache stores attention keys and values for every processed token, growing linearly with sequence length. A conversation with 16K tokens consumes far more VRAM than a 4K token exchange, even using the same model architecture.
Not with standard full fine-tuning, which needs 12-18 GB per billion parameters (840-1,260 GB total). You'll need INT4 quantization to fit the model at around 35 GB, or use parameter-efficient methods like LoRA that freeze base weights and train small adapter layers instead.