TEST SITE

Button Text

Everyone focuses on GPU speed, but what VRAM is and how much you have determines whether your model loads at all. A 70B model needs 140GB in FP16, which won't fit on most single GPUs no matter how fast they are. The type of memory matters too because HBM delivers over 3 TB/s bandwidth while GDDR tops out around 1 TB/s, and that gap shows up immediately in training times for large models.

TLDR:

VRAM is dedicated GPU memory storing textures, models, and compute data for instant access.
HBM memory delivers 3+ TB/s bandwidth vs GDDR6's 768 GB/s, critical for AI workloads.
LLM inference needs ~2 GB VRAM per billion parameters; 70B models require 140 GB in FP16.
Running out of VRAM forces slower system RAM swapping, creating performance bottlenecks.
Thunder Compute offers A100 GPUs from $0.99-$7.80/hr with 40GB or 80GB VRAM configurations.

what is vram?

VRAM stands for Video Random Access Memory. It's the dedicated memory that sits directly on your graphics card, separate from the system RAM connected to your CPU.

Think of VRAM as the GPU's private workspace. While your system RAM handles general computing tasks, VRAM stores everything your GPU needs for immediate processing: textures, frame buffers, shader programs, and computational data. This physical proximity to the GPU chip matters because data transfer speeds between the processor and memory can make or break performance, whether you're working with CUDA or ROCm for GPU computing.

The architecture differs from system RAM in key ways. VRAM sits on the same circuit board as the GPU, connected through a wide, high-speed memory bus. This setup allows for much higher bandwidth compared to system RAM, which connects to the CPU through a different pathway.

When you run AI training jobs or render graphics, the GPU pulls data from VRAM thousands of times per second. Running out of VRAM forces the system to swap data with slower system RAM, creating a bottleneck that tanks performance.

Types of GPU Memory: GDDR vs HBM

Graphics cards use two main types of memory: GDDR (Graphics Double Data Rate) and HBM (High Bandwidth Memory). Each takes a different approach to speed and capacity.

GDDR6, the current standard for consumer GPUs, connects to the graphics chip through a traditional PCB layout. It offers decent bandwidth at lower costs, making it the go-to choice for gaming cards and entry-level workstations. GDDR6X pushes this further with higher transfer rates.

HBM stacks memory dies vertically, sitting much closer to the GPU die. This design delivers superior bandwidth and lower power consumption compared to GDDR variants. Data centers favor HBM2 and HBM3 for AI accelerators like the A100 and H100, where memory bandwidth directly impacts training speed.

The tradeoff comes down to cost versus performance. HBM memory costs more to manufacture but pays off for compute-heavy workloads that constantly move large datasets.

vram types and specifications

gddr memory architectures

GDDR chips sit in individual modules soldered around the GPU die on the printed circuit board. This modular layout allows manufacturers to adjust memory capacity by adding or removing chips, which is why you see GPUs with varying VRAM configurations using the same core architecture.

GDDR6X introduced PAM4 (4-level pulse amplitude modulation) signaling, which transmits two bits per cycle instead of one. This doubles data rates without requiring higher clock speeds, reducing heat output while pushing bandwidth past 1 TB/s on high-end cards.

high bandwidth memory

HBM stacks multiple memory dies vertically, connected through microscopic through-silicon vias. These stacks sit directly beside the GPU die on a silicon interposer, creating an extremely wide data path.

Where GDDR maxes out around 384-bit bus width, HBM configurations reach 1024-bit and beyond. HBM3 can hit over 3 TB/s bandwidth per stack. HBM costs roughly 3-4x more than equivalent GDDR capacity, which is why data center GPUs pay for the expense while consumer cards stick with GDDR. Cloud providers offering both options provide alternatives to platforms like Runpod for accessing these high-memory GPUs.

Memory Type	Bus Width	Bandwidth per Stack	Typical Capacity	Primary Use Case
GDDR6	384-bit max	Up to 768 GB/s	8-24 GB	Consumer GPUs, gaming
GDDR6X	384-bit max	Up to 1 TB/s	12-24 GB	High-end gaming GPUs
HBM2	1024-bit	Up to 307 GB/s	8-24 GB per stack	Data center GPUs
HBM3	1024-5120-bit	Up to 3.35 TB/s	16-24 GB per stack	AI accelerators, HPC

HBM Memory Architecture and Performance

The counterintuitive part: an individual HBM chip runs slower than a single GDDR6 chip. But HBM's architecture turns this into an advantage through parallelization.

HBM splits data across many narrow channels running in parallel. While each channel operates at lower speeds, the combined throughput from dozens of simultaneous paths outperforms GDDR's fewer, faster lanes.

HBM3 represents the current generation. The NVIDIA H100 80GB pairs its tensor cores with HBM3 memory running a 5120-bit bus at 3.35 TB/s bandwidth. That's roughly triple what you get from GDDR6X cards.

The scalability factor matters too. Stack more HBM dies, and bandwidth scales almost linearly. GDDR6 hits physical limitations much sooner due to PCB routing constraints and signal integrity issues at higher speeds.

vram vs system ram

System RAM connects to your CPU through the motherboard, typically using a 64-bit or 128-bit memory bus. VRAM sits millimeters from the GPU cores on the same board, using bus widths ranging from 256-bit to 5120-bit.

This placement difference creates a massive bandwidth gap. DDR5 system RAM tops out around 80 GB/s per channel. GPU memory delivers 600 GB/s to 3+ TB/s depending on the configuration. GPUs need this speed because they process thousands of operations at once, while CPUs handle sequential tasks.

When your GPU runs out of VRAM, the system pages data back to system RAM over the PCIe bus. PCIe 4.0 x16 maxes out at roughly 32 GB/s in each direction, far slower than direct VRAM access. This creates stuttering in real-time applications and extends training times for AI models.

vram requirements for ai and ml workloads

AI workloads place heavy demands on VRAM. LLM inference in half-precision requires roughly 2 GB per billion parameters for weights alone. A 13B parameter model needs about 26 GB before KV cache and framework overhead. Full fine-tuning demands far more because optimizer states, gradients, and activations can push requirements to 4-6x the inference baseline.

Quantization reduces memory by lowering precision. INT8 cuts VRAM needs in half versus FP16, while INT4 drops to one-quarter. A 70B model needs 140 GB in FP16 but only 35 GB after INT4 quantization.

Model Size	FP16 Inference	INT4 Quantized	Recommended GPU VRAM
7B params	~14 GB	~3.5 GB	16 GB minimum
13B params	~26 GB	~6.5 GB	24-32 GB
70B params	~140 GB	~35 GB	80 GB or multi-GPU

Memory Requirements for Model Training

Training memory requirements go beyond simple model weights. Full fine-tuning needs 12-18 GB per billion parameters, accounting for optimizer states, gradients, and activation checkpointing during backpropagation.

Adam optimizer alone stores two additional copies of model parameters for momentum and variance calculations. That triples memory overhead before gradients and activations enter the picture.

Parameter-efficient fine-tuning methods like LoRA cut these requirements by freezing base model weights and training small adapter layers. This drops training memory closer to inference levels while maintaining quality.

Batch size directly impacts VRAM consumption during training. Larger batches improve GPU utilization but require more memory for activations. Gradient accumulation simulates larger batches by processing smaller chunks sequentially.

An 80GB A100 handles most single-GPU training tasks, while distributed setups across multiple GPUs split memory demands for larger models.

why vram matters in gpu cloud computing

VRAM capacity sets hard limits on which models you can deploy. Loading a 70B parameter model on a 24GB GPU triggers out-of-memory errors before inference begins, no matter how fast the cores run.

This shapes cloud spending directly. An A100 with 40GB costs less per hour than the 80GB version. If your model fits in 40GB, paying for 80GB wastes budget on unused capacity, so budget GPU providers help match your needs to the right configuration.

Batch size scales with available memory. More VRAM means processing more requests at once, boosting throughput and cutting per-inference costs. The KV cache expands with every token processed, so longer context windows consume VRAM quickly. A 4K context window on a 13B model requires far less memory than 32K context on the same architecture.

Insufficient VRAM forces model sharding across multiple GPUs, adding network latency and complicating deployment versus running everything on one high-memory instance.

KV Cache and Dynamic Memory Usage

Memory requirements extend beyond static model weights during inference. The KV cache stores attention keys and values for every token processed, growing linearly with sequence length. A 70B parameter model generating long responses can consume tens of gigabytes just for this cache.

Even if you load a model successfully, the KV cache expands with each generated token. A 4K token conversation requires far less memory than a 16K token context, even with identical model architecture. This growth catches people off guard when deploying LLMs in production.

The 70B FP16 example proves the point: 140GB of weight data exceeds what fits on a single A100 80GB. You need quantization to INT4 or multi-GPU setups to run it. Multi-GPU training splits memory across instances, but adds complexity to your deployment setup.

Final Thoughts on VRAM Selection for ML Teams

Picking the right amount of VRAM separates efficient deployments from budget-draining mistakes. Knowing what VRAM your models need before you spin up instances saves you from paying for 80GB cards when 40GB would handle the job. Memory bandwidth shapes training speed and inference throughput more directly than core counts, so start your planning with VRAM specs and work outward from there. Get the memory right and everything else falls into place.\u2014mess it up and you're stuck with hardware that can't run your models or costs twice what you should be paying.

FAQ

How much VRAM do I need to run a 13B parameter model?

You need around 26 GB of VRAM for inference in half-precision (FP16), but this drops to about 6.5 GB with INT4 quantization. A GPU with 24-32 GB of VRAM handles most use cases comfortably.

What's the main difference between GDDR and HBM memory?

GDDR memory uses individual chips arranged around the GPU, while HBM stacks memory dies vertically next to the GPU chip for much higher bandwidth. HBM costs 3-4x more but delivers up to 3+ TB/s bandwidth compared to GDDR6's typical 600-800 GB/s.

Why does my model run slower when VRAM fills up?

When VRAM runs out, your system swaps data with slower system RAM over the PCIe bus, which maxes out around 32 GB/s versus VRAM's 600+ GB/s speeds. This creates a bottleneck that tanks performance in training and inference.

How does KV cache affect VRAM usage during inference?

The KV cache stores attention keys and values for every processed token, growing linearly with sequence length. A conversation with 16K tokens consumes far more VRAM than a 4K token exchange, even using the same model architecture.

Can I train a 70B model on a single 80GB GPU?

Not with standard full fine-tuning, which needs 12-18 GB per billion parameters (840-1,260 GB total). You'll need INT4 quantization to fit the model at around 35 GB, or use parameter-efficient methods like LoRA that freeze base weights and train small adapter layers instead.