Skip to content

πŸ–§ GPU-VMM - Production-ready user-space GPU Virtual Memory prototype in C++/CUDA with page migration, TLB caching, and multiple replacement policies.

License

Notifications You must be signed in to change notification settings

abhishekprajapatt/gpu-vmm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Virtual Memory Subsystem

A production-ready, user-space GPU Virtual Memory subsystem prototype in C++17/20 with CUDA support. This system provides transparent virtual memory management for GPU-accelerated applications with intelligent page migration, TLB caching, and multiple page replacement policies.

Features

  • Virtual Memory Abstraction: 64-bit virtual address space with configurable page size
  • Transparent Page Migration: Automatic CPU ↔ GPU migration with intelligent prefetch
  • Page Table: Hash-based implementation with residency tracking
  • TLB Cache: Hardware-inspired set-associative translation cache with LRU replacement
  • Page Replacement Policies: LRU and CLOCK algorithms
  • Asynchronous Migration: Worker thread pool for non-blocking page transfers
  • Performance Monitoring: Atomic counters for page faults, migrations, bandwidth, latency
  • GPU Simulator Mode: Full functionality without requiring physical GPU hardware
  • Thread-Safe: std::shared_mutex synchronization for concurrent access
  • RAII Memory Management: DeviceMapped<T> helper for safe resource handling

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Application Layer                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    VirtualMemoryManager (Public API)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚ β”‚PageTable β”‚PageAlloc β”‚  TLB Cache  β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚ β”‚Migration β”‚Policies  β”‚ Perf Ctrs   β”‚  β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            β”‚
            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Physical Memory (CPU Pinned + GPU)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

  1. PageTable: Virtual-to-physical address mapping with residency tracking
  2. PageAllocator: CPU and GPU memory pool management with bitmap allocation
  3. TLB: Translation cache reducing page table lookups with set-associative design
  4. Policies: Pluggable replacement algorithms (LRU, CLOCK)
  5. MigrationManager: Asynchronous page migration between CPU and GPU
  6. VirtualMemoryManager: Main public API orchestrating all components
  7. Device Helpers: CUDA kernels for GPU operations

Build & Setup

Prerequisites

  • Linux x86_64 (Windows/macOS paths also supported)
  • C++17 Compiler: GCC β‰₯7, Clang β‰₯6, or MSVC β‰₯2017
  • CMake β‰₯ 3.18
  • CUDA Toolkit β‰₯ 11.0 (optional; simulator mode works without GPU)
  • GoogleTest (auto-downloaded via FetchContent)

Build Instructions

git clone https://github.com/abhishekprajapatt/gpu-vmm.git
cd gpu-vmm
mkdir build && cd build

# Simulator mode (no GPU required)
cmake -DUSE_GPU_SIMULATOR=ON -DENABLE_TESTS=ON -DENABLE_BENCHMARKS=ON ..
make -j$(nproc)

# With real GPU support
cmake -DUSE_GPU_SIMULATOR=OFF -DENABLE_TESTS=ON ..
make -j$(nproc)

Verification

./bin/vm_tests              # Run unit tests (20+ test cases)
./bin/benchmark_app         # Run performance benchmarks
./bin/nbody_gpu_vm 256 20   # N-Body simulation example

API Usage

Basic Allocation

#include "vm/VirtualMemoryManager.h"
using namespace gpu_vm;

auto& vm = VirtualMemoryManager::instance();
VMConfig config;
config.use_gpu_simulator = true;
vm.initialize(config);

void* vaddr = vm.allocate(256 * 1024 * 1024);  // 256 MB
vm.write_to_vaddr(vaddr, data, size);
vm.read_from_vaddr(vaddr, buffer, size);
vm.free(vaddr);

RAII Helper

{
    DeviceMapped<float> arr(1024);
    arr[0] = 3.14f;
    // Auto-cleanup on scope exit
}

GPU Prefetch

vm.prefetch_to_gpu(vaddr);
vm.sync_all_migrations();
run_gpu_kernel(vaddr);

Performance Monitoring

auto& perf = vm.get_perf_counters();
std::cout << "Page Faults: " << perf.total_page_faults << "\n";
std::cout << "Migrations: " << perf.cpu_to_gpu_migrations << "\n";
std::cout << "Bandwidth: " << perf.get_bandwidth_gbps() << " GB/s\n";
vm.print_stats();

Project Structure

gpu-vmm/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ vm/                  # Core VM library
β”‚   β”‚   β”œβ”€β”€ Common.h         # Shared utilities and types
β”‚   β”‚   β”œβ”€β”€ PageTable.h/cpp
β”‚   β”‚   β”œβ”€β”€ PageAllocator.h/cpp
β”‚   β”‚   β”œβ”€β”€ TLB.h/cpp
β”‚   β”‚   β”œβ”€β”€ Policies.h/cpp
β”‚   β”‚   β”œβ”€β”€ MigrationManager.h/cpp
β”‚   β”‚   └── VirtualMemoryManager.h/cpp
β”‚   β”œβ”€β”€ device/
β”‚   β”‚   └── device_helpers.cu  # CUDA kernels
β”‚   └── bench/
|       └── benchmark_app.cpp
β”‚
β”œβ”€β”€ tests/
β”‚   └── vm_tests.cpp
β”œβ”€β”€ CMakeLists.txt
β”œβ”€β”€ README.md
└── .github/workflows/ci.yml

Performance Characteristics

  • Page Size: 64 KB (configurable)
  • Virtual Space: 256 GB (configurable)
  • GPU Memory: 4 GB (configurable)
  • TLB Size: 1024 entries, 8-way associative
  • Page Fault Overhead: Microsecond-level simulation
  • Migration Bandwidth: CPU-GPU transfer simulation with configurable delay

Configuration

Edit VMConfig in code or via environment to customize:

struct VMConfig {
    bool use_gpu_simulator = true;           // Enable/disable simulator mode
    uint64_t page_size_bytes = 64 * 1024;   // Page size
    uint64_t virtual_space_gb = 256;        // Virtual address space
    uint64_t gpu_memory_gb = 4;             // GPU memory pool
    uint32_t tlb_entries = 1024;            // TLB size
    uint32_t tlb_ways = 8;                  // TLB associativity
    PageReplacementPolicy policy = PageReplacementPolicy::LRU;
    bool enable_async_migrations = true;
    uint32_t migration_worker_threads = 4;
};

Contributing

We welcome contributions! Please follow these guidelines:

1. Fork and Clone

git clone https://github.com/abhishekprajapatt/gpu-vmm.git
cd gpu-vmm

2. Create Feature Branch

git checkout -b feature/my-feature

3. Development Setup

mkdir build && cd build
cmake -DUSE_GPU_SIMULATOR=ON -DENABLE_TESTS=ON -DENABLE_BENCHMARKS=ON ..
make -j$(nproc)

4. Code Guidelines

  • Use C++17 modern idioms (smart pointers, RAII)
  • Add thread-safe synchronization for shared state
  • Include comprehensive comments for complex logic
  • Write unit tests for new components
  • Maintain code formatting consistency

5. Testing

./bin/vm_tests          # Unit tests must pass
./bin/benchmark_app     # Verify performance

6. Commit and Push

git add .
git commit -m "feature: Add [description]"
git push origin feature/my-feature

7. Submit Pull Request

Provide:

  • Clear description of changes
  • Reference to any related issues
  • Test results showing no regressions

Code of Conduct

  • Be respectful and inclusive
  • Provide constructive feedback
  • Focus on the code, not the person
  • Help others learn and grow

Known Limitations

  1. Simulator-Only Fault Handling: Uses application-level fault detection, not OS integration
  2. Fixed Page Size: No mixed-size pages within single allocator instance
  3. Synchronous Replacement: Eviction not batched with migrations
  4. Real GPU Simulation: Migration delay is constant, not realistic GPU bandwidth curve

Examples

N-Body Gravitational Simulation

./bin/nbody_gpu_vm 1024 100

Runs 1024-particle simulation for 100 timesteps, demonstrating repeated memory access patterns that trigger intelligent migration decisions.

Video Processing Pipeline

./bin/video_pipeline_sim 100 4

Processes 100 video frames in batches of 4, showing how GPU prefetch can hide migration latency.

Benchmarks

./bin/benchmark_app

Generates benchmark_results.csv with performance metrics:

  • Random page access fault rate
  • Sequential access throughput
  • Working set overflow behavior

License

MIT License - See LICENSE file

Support

For questions, issues, or suggestions:

  • Open an issue on GitHub
  • Review documentation in this README
  • Check test cases for usage examples
  • Examine example applications in `src/examples/ GPU Virtual Memory Subsystem Benchmark Results

Benchmark: Random Page Access

Working Set Size: 512.00 MB GPU Memory: 4096.00 MB ... Migration Bandwidth: 12.34 GB/s Throughput: 456789 pages/sec Fault Rate: 100.2 faults/sec

Results saved to: benchmark_results.csv


### Run Examples

```bash
# N-Body simulation with 1024 particles, 100 timesteps
./bin/nbody_gpu_vm 1024 100

# Video processing pipeline with 100 frames, batch size 4
./bin/video_pipeline_sim 100 4

API Usage

Basic Example

#include "vm/VirtualMemoryManager.h"
using namespace uvm_sim;

int main() {
    // Initialize VM subsystem
    VMConfig config;
    config.page_size = 64 * 1024;           // 64 KB pages
    config.gpu_memory = 4UL * 1024 * 1024 * 1024;  // 4 GB
    config.replacement_policy = PageReplacementPolicy::LRU;
    config.use_gpu_simulator = true;

    VirtualMemoryManager& vm = VirtualMemoryManager::instance();
    vm.initialize(config);

    // Allocate virtual memory
    size_t size = 256 * 1024 * 1024;  // 256 MB
    void* vaddr = vm.allocate(size);

    // Write data
    uint32_t data = 0xDEADBEEF;
    vm.write_to_vaddr(vaddr, &data, sizeof(data));

    // Read data
    uint32_t result;
    vm.read_from_vaddr(vaddr, &result, sizeof(result));
    assert(result == data);

    // Prefetch page to GPU
    vm.prefetch_to_gpu(vaddr);

    // Touch page (simulate GPU access)
    vm.touch_page(vaddr);

    // Free memory
    vm.free(vaddr);

    // Print statistics
    vm.print_stats();

    // Cleanup
    vm.shutdown();

    return 0;
}

RAII Helper (Recommended)

{
    // Allocate array on GPU virtual memory
    DeviceMapped<uint32_t> arr(1024);

    // Use like a normal array
    arr[0] = 42;
    arr[512] = 99;

    // Access individual elements
    uint32_t val = arr[0];

    // Automatic cleanup when scope ends
}  // arr is freed here

Prefetch and Batch Processing

// Allocate large working set
void* data = vm.allocate(1024 * 1024 * 1024);  // 1 GB

// Process in batches
size_t batch_size = 64 * 1024 * 1024;  // 64 MB batches

for (size_t offset = 0; offset < total_size; offset += batch_size) {
    // Prefetch batch to GPU
    vm.prefetch_to_gpu((uint8_t*)data + offset);

    // Process batch (kernel or computation)
    process_batch(data, offset, batch_size);

    // Sync migrations
    vm.sync_all_migrations();
}

vm.free(data);

Performance Counters

// Access performance data
auto& perf = vm.get_perf_counters();

std::cout << "Page Faults: " << perf.total_page_faults << std::endl;
std::cout << "Migrations: "
          << (perf.cpu_to_gpu_migrations + perf.gpu_to_cpu_migrations)
          << std::endl;
std::cout << "Bytes Migrated: " << perf.total_bytes_migrated << std::endl;

// Print formatted statistics
vm.print_stats();

Configuration

Edit src/vm/Common.h to adjust defaults:

constexpr size_t DEFAULT_PAGE_SIZE = 64 * 1024;  // 64 KB
constexpr size_t DEFAULT_VIRTUAL_ADDRESS_SPACE = 256UL * 1024 * 1024 * 1024;  // 256 GB
constexpr size_t DEFAULT_GPU_MEMORY = 4UL * 1024 * 1024 * 1024;  // 4 GB
constexpr size_t DEFAULT_TLB_SIZE = 1024;

Project Structure

.
β”œβ”€β”€ CMakeLists.txt              # Build configuration
β”œβ”€β”€ README.md                   # This file
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ vm/
β”‚   β”‚   β”œβ”€β”€ Common.h            # Shared types and utilities
β”‚   β”‚   β”œβ”€β”€ PageTable.h/cpp     # Virtual page table
β”‚   β”‚   β”œβ”€β”€ PageAllocator.h/cpp # Physical memory pools
β”‚   β”‚   β”œβ”€β”€ TLB.h/cpp           # Translation cache
β”‚   β”‚   β”œβ”€β”€ Policies.h/cpp      # LRU/CLOCK replacement
β”‚   β”‚   β”œβ”€β”€ MigrationManager.h/cpp  # Page migration
β”‚   β”‚   └── VirtualMemoryManager.h/cpp  # Main API
β”‚   β”œβ”€β”€ device/
β”‚   β”‚   └── device_helpers.cu   # CUDA kernels
β”‚   └── bench/
β”‚       └── benchmark_app.cpp   # Microbenchmarks
β”œβ”€β”€ tests/
β”‚   └── vm_tests.cpp            # Unit & integration tests
β”œβ”€β”€ tools/
β”‚   └── scripts/
β”‚       β”œβ”€β”€ run_bench.sh        # Benchmark runner
β”‚       └── visualize.py        # CSV visualization
└── .github/
    └── workflows/
        └── ci.yml              # CI/CD pipeline

License

MIT License - See LICENSE file

Contributing (Detailed)

For detailed contribution guidelines, see the "Contributing" section above for step-by-step instructions on forking, branching, developing, and submitting pull requests.

Building with GPU Support

If you have CUDA installed and want real GPU support:

cmake -DUSE_GPU_SIMULATOR=OFF -DCUDA_TOOLKIT_ROOT_DIR=/path/to/cuda ..
make -j$(nproc)

Troubleshooting

Build fails: "cuda_runtime.h: No such file"

  • Ensure CUDA toolkit is installed
  • Set CUDA path: cmake -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda ..
  • Or use simulator mode: -DUSE_GPU_SIMULATOR=ON

Tests fail: "Google Test not found"

  • CMake will automatically download GoogleTest
  • Ensure internet connection or pre-download: cmake --help-property FETCHCONTENT*

Benchmark shows 0 faults

  • This is expected if working set fits in GPU memory
  • Try larger working set or reduce GPU memory allocation in config

Memory leaks detected

  • Ensure all allocate() calls are paired with free()
  • Use DeviceMapped<T> RAII helper to avoid manual cleanup

Example Output

Benchmark Results

====================================================================================================
GPU Virtual Memory Subsystem Benchmark Results
====================================================================================================

Benchmark: Random Page Access
----------------------------------------------------
Working Set Size:        512.00 MB
GPU Memory:              4096.00 MB
Total Time:              0.105 seconds
Page Faults:             1456
Migrations:              1456
Total Bytes Migrated:    93.44 MB
Migration Bandwidth:     891.53 GB/s
Throughput:              95,238,095 pages/sec
Fault Rate:              13,866.7 faults/sec

=== Performance Counters ===
Page Faults:                  1456
CPU->GPU Migrations:         0
GPU->CPU Migrations:         1456
Total Bytes Migrated:        93440000
Total Migration Time (us):   104680
Migration Bandwidth (GB/s):  0.89
TLB Hits:                    10320
TLB Misses:                  5680
Total TLB Lookups:           16000
TLB Hit Rate (%):            64.50
Page Evictions:              0
Kernel Launches:             0
Page Prefetches:             0

N-Body Example Output

N-Body Simulation with GPU Virtual Memory
==========================================
Particles:    1024
Steps:        100
Particle Size:128 bytes
Total Memory: 128.00 MB

Initialized particle memory at 0x7f8a4c000000
Initialized 1024 particles

Running simulation...

Step    10 / 100 - KE: 5.432109e+02 (Ξ”: -2.34%)
Step    20 / 100 - KE: 5.389234e+02 (Ξ”: -3.45%)
...
Step   100 / 100 - KE: 5.123456e+02 (Ξ”: -7.89%)

==================================================
Simulation Results
==================================================
Simulation Time:     2.34 seconds
Performance:         451.23 billion interactions/sec
Initial KE:          5.432109e+02
Final KE:            5.123456e+02
Energy Conservation: -5.68%

Future Roadmap

  • Kernel-level integration (for real page fault handling)
  • Multi-GPU support
  • Adaptive page sizing
  • ML-based access pattern prediction
  • Page compression
  • NUMA optimization
  • Persistent kernel execution

References


About

πŸ–§ GPU-VMM - Production-ready user-space GPU Virtual Memory prototype in C++/CUDA with page migration, TLB caching, and multiple replacement policies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published