A production-ready, user-space GPU Virtual Memory subsystem prototype in C++17/20 with CUDA support. This system provides transparent virtual memory management for GPU-accelerated applications with intelligent page migration, TLB caching, and multiple page replacement policies.
- Virtual Memory Abstraction: 64-bit virtual address space with configurable page size
- Transparent Page Migration: Automatic CPU β GPU migration with intelligent prefetch
- Page Table: Hash-based implementation with residency tracking
- TLB Cache: Hardware-inspired set-associative translation cache with LRU replacement
- Page Replacement Policies: LRU and CLOCK algorithms
- Asynchronous Migration: Worker thread pool for non-blocking page transfers
- Performance Monitoring: Atomic counters for page faults, migrations, bandwidth, latency
- GPU Simulator Mode: Full functionality without requiring physical GPU hardware
- Thread-Safe: std::shared_mutex synchronization for concurrent access
- RAII Memory Management:
DeviceMapped<T>helper for safe resource handling
ββββββββββββββββββββββββββββββββββββββββββ
β Application Layer β
ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββ
β VirtualMemoryManager (Public API) β
ββββββββββββββββββββββββββββββββββββββββββ€
β ββββββββββββ¬βββββββββββ¬ββββββββββββββ β
β βPageTable βPageAlloc β TLB Cache β β
β ββββββββββββ΄βββββββββββ΄ββββββββββββββ β
β ββββββββββββ¬βββββββββββ¬ββββββββββββββ β
β βMigration βPolicies β Perf Ctrs β β
β ββββββββββββ΄βββββββββββ΄ββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββ
β Physical Memory (CPU Pinned + GPU) β
ββββββββββββββββββββββββββββββββββββββββββ
- PageTable: Virtual-to-physical address mapping with residency tracking
- PageAllocator: CPU and GPU memory pool management with bitmap allocation
- TLB: Translation cache reducing page table lookups with set-associative design
- Policies: Pluggable replacement algorithms (LRU, CLOCK)
- MigrationManager: Asynchronous page migration between CPU and GPU
- VirtualMemoryManager: Main public API orchestrating all components
- Device Helpers: CUDA kernels for GPU operations
- Linux x86_64 (Windows/macOS paths also supported)
- C++17 Compiler: GCC β₯7, Clang β₯6, or MSVC β₯2017
- CMake β₯ 3.18
- CUDA Toolkit β₯ 11.0 (optional; simulator mode works without GPU)
- GoogleTest (auto-downloaded via FetchContent)
git clone https://github.com/abhishekprajapatt/gpu-vmm.git
cd gpu-vmm
mkdir build && cd build
# Simulator mode (no GPU required)
cmake -DUSE_GPU_SIMULATOR=ON -DENABLE_TESTS=ON -DENABLE_BENCHMARKS=ON ..
make -j$(nproc)
# With real GPU support
cmake -DUSE_GPU_SIMULATOR=OFF -DENABLE_TESTS=ON ..
make -j$(nproc)./bin/vm_tests # Run unit tests (20+ test cases)
./bin/benchmark_app # Run performance benchmarks
./bin/nbody_gpu_vm 256 20 # N-Body simulation example#include "vm/VirtualMemoryManager.h"
using namespace gpu_vm;
auto& vm = VirtualMemoryManager::instance();
VMConfig config;
config.use_gpu_simulator = true;
vm.initialize(config);
void* vaddr = vm.allocate(256 * 1024 * 1024); // 256 MB
vm.write_to_vaddr(vaddr, data, size);
vm.read_from_vaddr(vaddr, buffer, size);
vm.free(vaddr);{
DeviceMapped<float> arr(1024);
arr[0] = 3.14f;
// Auto-cleanup on scope exit
}vm.prefetch_to_gpu(vaddr);
vm.sync_all_migrations();
run_gpu_kernel(vaddr);auto& perf = vm.get_perf_counters();
std::cout << "Page Faults: " << perf.total_page_faults << "\n";
std::cout << "Migrations: " << perf.cpu_to_gpu_migrations << "\n";
std::cout << "Bandwidth: " << perf.get_bandwidth_gbps() << " GB/s\n";
vm.print_stats();gpu-vmm/
βββ src/
β βββ vm/ # Core VM library
β β βββ Common.h # Shared utilities and types
β β βββ PageTable.h/cpp
β β βββ PageAllocator.h/cpp
β β βββ TLB.h/cpp
β β βββ Policies.h/cpp
β β βββ MigrationManager.h/cpp
β β βββ VirtualMemoryManager.h/cpp
β βββ device/
β β βββ device_helpers.cu # CUDA kernels
β βββ bench/
| βββ benchmark_app.cpp
β
βββ tests/
β βββ vm_tests.cpp
βββ CMakeLists.txt
βββ README.md
βββ .github/workflows/ci.yml
- Page Size: 64 KB (configurable)
- Virtual Space: 256 GB (configurable)
- GPU Memory: 4 GB (configurable)
- TLB Size: 1024 entries, 8-way associative
- Page Fault Overhead: Microsecond-level simulation
- Migration Bandwidth: CPU-GPU transfer simulation with configurable delay
Edit VMConfig in code or via environment to customize:
struct VMConfig {
bool use_gpu_simulator = true; // Enable/disable simulator mode
uint64_t page_size_bytes = 64 * 1024; // Page size
uint64_t virtual_space_gb = 256; // Virtual address space
uint64_t gpu_memory_gb = 4; // GPU memory pool
uint32_t tlb_entries = 1024; // TLB size
uint32_t tlb_ways = 8; // TLB associativity
PageReplacementPolicy policy = PageReplacementPolicy::LRU;
bool enable_async_migrations = true;
uint32_t migration_worker_threads = 4;
};We welcome contributions! Please follow these guidelines:
git clone https://github.com/abhishekprajapatt/gpu-vmm.git
cd gpu-vmmgit checkout -b feature/my-featuremkdir build && cd build
cmake -DUSE_GPU_SIMULATOR=ON -DENABLE_TESTS=ON -DENABLE_BENCHMARKS=ON ..
make -j$(nproc)- Use C++17 modern idioms (smart pointers, RAII)
- Add thread-safe synchronization for shared state
- Include comprehensive comments for complex logic
- Write unit tests for new components
- Maintain code formatting consistency
./bin/vm_tests # Unit tests must pass
./bin/benchmark_app # Verify performancegit add .
git commit -m "feature: Add [description]"
git push origin feature/my-featureProvide:
- Clear description of changes
- Reference to any related issues
- Test results showing no regressions
- Be respectful and inclusive
- Provide constructive feedback
- Focus on the code, not the person
- Help others learn and grow
- Simulator-Only Fault Handling: Uses application-level fault detection, not OS integration
- Fixed Page Size: No mixed-size pages within single allocator instance
- Synchronous Replacement: Eviction not batched with migrations
- Real GPU Simulation: Migration delay is constant, not realistic GPU bandwidth curve
./bin/nbody_gpu_vm 1024 100Runs 1024-particle simulation for 100 timesteps, demonstrating repeated memory access patterns that trigger intelligent migration decisions.
./bin/video_pipeline_sim 100 4Processes 100 video frames in batches of 4, showing how GPU prefetch can hide migration latency.
./bin/benchmark_appGenerates benchmark_results.csv with performance metrics:
- Random page access fault rate
- Sequential access throughput
- Working set overflow behavior
MIT License - See LICENSE file
For questions, issues, or suggestions:
- Open an issue on GitHub
- Review documentation in this README
- Check test cases for usage examples
Working Set Size: 512.00 MB GPU Memory: 4096.00 MB ... Migration Bandwidth: 12.34 GB/s Throughput: 456789 pages/sec Fault Rate: 100.2 faults/sec
Results saved to: benchmark_results.csv
### Run Examples
```bash
# N-Body simulation with 1024 particles, 100 timesteps
./bin/nbody_gpu_vm 1024 100
# Video processing pipeline with 100 frames, batch size 4
./bin/video_pipeline_sim 100 4
#include "vm/VirtualMemoryManager.h"
using namespace uvm_sim;
int main() {
// Initialize VM subsystem
VMConfig config;
config.page_size = 64 * 1024; // 64 KB pages
config.gpu_memory = 4UL * 1024 * 1024 * 1024; // 4 GB
config.replacement_policy = PageReplacementPolicy::LRU;
config.use_gpu_simulator = true;
VirtualMemoryManager& vm = VirtualMemoryManager::instance();
vm.initialize(config);
// Allocate virtual memory
size_t size = 256 * 1024 * 1024; // 256 MB
void* vaddr = vm.allocate(size);
// Write data
uint32_t data = 0xDEADBEEF;
vm.write_to_vaddr(vaddr, &data, sizeof(data));
// Read data
uint32_t result;
vm.read_from_vaddr(vaddr, &result, sizeof(result));
assert(result == data);
// Prefetch page to GPU
vm.prefetch_to_gpu(vaddr);
// Touch page (simulate GPU access)
vm.touch_page(vaddr);
// Free memory
vm.free(vaddr);
// Print statistics
vm.print_stats();
// Cleanup
vm.shutdown();
return 0;
}{
// Allocate array on GPU virtual memory
DeviceMapped<uint32_t> arr(1024);
// Use like a normal array
arr[0] = 42;
arr[512] = 99;
// Access individual elements
uint32_t val = arr[0];
// Automatic cleanup when scope ends
} // arr is freed here// Allocate large working set
void* data = vm.allocate(1024 * 1024 * 1024); // 1 GB
// Process in batches
size_t batch_size = 64 * 1024 * 1024; // 64 MB batches
for (size_t offset = 0; offset < total_size; offset += batch_size) {
// Prefetch batch to GPU
vm.prefetch_to_gpu((uint8_t*)data + offset);
// Process batch (kernel or computation)
process_batch(data, offset, batch_size);
// Sync migrations
vm.sync_all_migrations();
}
vm.free(data);// Access performance data
auto& perf = vm.get_perf_counters();
std::cout << "Page Faults: " << perf.total_page_faults << std::endl;
std::cout << "Migrations: "
<< (perf.cpu_to_gpu_migrations + perf.gpu_to_cpu_migrations)
<< std::endl;
std::cout << "Bytes Migrated: " << perf.total_bytes_migrated << std::endl;
// Print formatted statistics
vm.print_stats();Edit src/vm/Common.h to adjust defaults:
constexpr size_t DEFAULT_PAGE_SIZE = 64 * 1024; // 64 KB
constexpr size_t DEFAULT_VIRTUAL_ADDRESS_SPACE = 256UL * 1024 * 1024 * 1024; // 256 GB
constexpr size_t DEFAULT_GPU_MEMORY = 4UL * 1024 * 1024 * 1024; // 4 GB
constexpr size_t DEFAULT_TLB_SIZE = 1024;.
βββ CMakeLists.txt # Build configuration
βββ README.md # This file
βββ src/
β βββ vm/
β β βββ Common.h # Shared types and utilities
β β βββ PageTable.h/cpp # Virtual page table
β β βββ PageAllocator.h/cpp # Physical memory pools
β β βββ TLB.h/cpp # Translation cache
β β βββ Policies.h/cpp # LRU/CLOCK replacement
β β βββ MigrationManager.h/cpp # Page migration
β β βββ VirtualMemoryManager.h/cpp # Main API
β βββ device/
β β βββ device_helpers.cu # CUDA kernels
β βββ bench/
β βββ benchmark_app.cpp # Microbenchmarks
βββ tests/
β βββ vm_tests.cpp # Unit & integration tests
βββ tools/
β βββ scripts/
β βββ run_bench.sh # Benchmark runner
β βββ visualize.py # CSV visualization
βββ .github/
βββ workflows/
βββ ci.yml # CI/CD pipeline
MIT License - See LICENSE file
For detailed contribution guidelines, see the "Contributing" section above for step-by-step instructions on forking, branching, developing, and submitting pull requests.
If you have CUDA installed and want real GPU support:
cmake -DUSE_GPU_SIMULATOR=OFF -DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda ..
make -j$(nproc)- Ensure CUDA toolkit is installed
- Set CUDA path:
cmake -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda .. - Or use simulator mode:
-DUSE_GPU_SIMULATOR=ON
- CMake will automatically download GoogleTest
- Ensure internet connection or pre-download:
cmake --help-property FETCHCONTENT*
- This is expected if working set fits in GPU memory
- Try larger working set or reduce GPU memory allocation in config
- Ensure all
allocate()calls are paired withfree() - Use
DeviceMapped<T>RAII helper to avoid manual cleanup
====================================================================================================
GPU Virtual Memory Subsystem Benchmark Results
====================================================================================================
Benchmark: Random Page Access
----------------------------------------------------
Working Set Size: 512.00 MB
GPU Memory: 4096.00 MB
Total Time: 0.105 seconds
Page Faults: 1456
Migrations: 1456
Total Bytes Migrated: 93.44 MB
Migration Bandwidth: 891.53 GB/s
Throughput: 95,238,095 pages/sec
Fault Rate: 13,866.7 faults/sec
=== Performance Counters ===
Page Faults: 1456
CPU->GPU Migrations: 0
GPU->CPU Migrations: 1456
Total Bytes Migrated: 93440000
Total Migration Time (us): 104680
Migration Bandwidth (GB/s): 0.89
TLB Hits: 10320
TLB Misses: 5680
Total TLB Lookups: 16000
TLB Hit Rate (%): 64.50
Page Evictions: 0
Kernel Launches: 0
Page Prefetches: 0
N-Body Simulation with GPU Virtual Memory
==========================================
Particles: 1024
Steps: 100
Particle Size:128 bytes
Total Memory: 128.00 MB
Initialized particle memory at 0x7f8a4c000000
Initialized 1024 particles
Running simulation...
Step 10 / 100 - KE: 5.432109e+02 (Ξ: -2.34%)
Step 20 / 100 - KE: 5.389234e+02 (Ξ: -3.45%)
...
Step 100 / 100 - KE: 5.123456e+02 (Ξ: -7.89%)
==================================================
Simulation Results
==================================================
Simulation Time: 2.34 seconds
Performance: 451.23 billion interactions/sec
Initial KE: 5.432109e+02
Final KE: 5.123456e+02
Energy Conservation: -5.68%
- Kernel-level integration (for real page fault handling)
- Multi-GPU support
- Adaptive page sizing
- ML-based access pattern prediction
- Page compression
- NUMA optimization
- Persistent kernel execution
- NVIDIA CUDA Unified Memory
- GPU Virtual Memory Papers
- Operating Systems textbooks (Tanenbaum, Silberschatz, Stallings)