NVBit Profiler Documentation

The NVBitProfiler integrates NVIDIA’s NVBit (NVIDIA Binary Instrumentation Tool) with MemSysExplorer to provide dynamic instrumentation of memory accesses at the GPU level. It collects memory access frequency and working set statistics during runtime by intercepting and analyzing memory instructions on CUDA binaries.

This lightweight profiler is suitable for evaluating the memory access behavior of GPU kernels without modifying source code.

Important

MemSysExplorer GitHub Repository

Refer to the codebase for the latest update: https://github.com/duca181/MemSysExplorer/tree/apps_dev/apps/profilers/nvbit

To learn more about license terms and third-party attribution, refer to the 6. Licensing and Attribution page.

Overview

NVBit leverages a custom shared object (nvbit.so) injected at runtime using the LD_PRELOAD mechanism. During execution, the instrumentation layer collects various memory-related statistics and writes them to a summary file.

The profiling workflow includes:

  1. Profiling (`profiling`) – Runs the target GPU application with NVBit instrumentation enabled.

  2. Metric Extraction (`extract_metrics`) – Parses the generated global_summary.txt file to extract relevant memory access statistics.

Required Arguments

The profiler defines required arguments for each stage as follows:

@classmethod
def required_profiling_args(cls):
    return ["executable"]

@classmethod
def required_extract_args(cls, action):
    if action == "extract_metrics":
        return ["report_file"]
    else:
        return []

Usage Example

  • Run NVBit profiling on a GPU application:

    python3 main.py --profiler nvbit --action profiling --executable ./your_gpu_app
    
  • Extract metrics from NVBit output:

    python3 main.py --profiler nvbit --action extract_metrics --report_file global_summary.txt
    
  • Do both profiling and extraction sequentially:

    python3 main.py --profiler nvbit --action both --executable ./your_gpu_app
    

Sample Output

The NVBit profiler in MemSysExplorer is based on sample instrumentation code provided by the NVBit SDK. We extend these scripts to extract memory traces and runtime statistics. The output consists of two main components:

  • Raw memory traces in the following format:

    MEMTRACE: CTX 0x%016lx - LAUNCH - Kernel pc 0x%016lx -
    Kernel name %s - grid launch id %ld - grid size %d,%d,%d
    - block size %d,%d,%d - nregs %d - shmem %d - cuda stream
    id %ld
    
  • MemSysExplorer memory statistics, exported in a structured format for downstream analysis.

Instrumentation Behavior

Instrumentation is injected dynamically using LD_PRELOAD, which loads the following shared object during execution:

profilers/nvbit/lib/nvbit.so

This shared object hooks into CUDA kernels and records memory access events. Statistics are written to a text file (default: global_summary.txt), which includes aggregated memory reads, writes, and access sizes.

Environment Notes

  • Ensure the nvbit.so library exists in profilers/nvbit/lib/ and has execute permissions.

  • The LD_PRELOAD environment variable is set automatically by the wrapper before program launch.

  • Only CUDA-enabled applications with kernel launches are supported. Pure CPU binaries are ignored silently.

Additional Information

  • The base tracing scripts are derived from official NVBit examples.

  • For more details and original implementation references, visit the NVBit official site: https://nvbit.github.io/NVBit/

Troubleshooting

  1. Missing `nvbit.so`: Ensure that lib/nvbit.so exists relative to the NVBitProfilers script.

  2. Permission Denied or Silent Exit: Confirm that your application is CUDA-based and built for a supported GPU architecture.

  3. No Output Generated: Make sure your application launches actual CUDA kernels. NVBit will not instrument empty or stub kernels.