git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples
git checkout v11.2 # 切换到 nvcc -V 的版本
cd /Samples/deviceQuery
make
./deviceQuery
输出 (重新 format 了一下, 便于对比):
NVIDIA GeForce MX250 Tesla V100-PCIE-16GB NVIDIA GeForce GTX 1650
CUDA Driver Version / Runtime Version 12.2 / 11.2 12.2 / 11.5 12.2 / 11.2
CUDA Capability Major/Minor version number: 6.1 7.0 7.5
Total amount of global memory: 2048 MBytes (2147352576 bytes) 16151 MBytes (16935419905 bytes) 4096 MBytes (4294639616 bytes)
( 3) Multiprocessors, (128) CUDA Cores/MP: 384 CUDA Cores (080) Multiprocessors, (064) CUDA Cores/MP: 5120 CUDA Cores (14) Multiprocessors, (64) CUDA Cores/MP: 896 CUDA Cores
GPU Max Clock rate: 1038 MHz (1.04 GHz) 1380 MHz (1.38 GHz) 1710 MHz (1.71 GHz)
Memory Clock rate: 3004 Mhz 877 MHz 6001 Mhz
Memory Bus Width: 64-bit 4096-bit 128-bit
L2 Cache Size: 524288 bytes 6291456 bytes 1048576 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers 1D=(32768), 2048 layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers 2D=(32768, 32768), 2048 layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes 65536 bytes 65536 bytes
Total amount of shared memory per block: 49152 bytes 49152 bytes 49152 bytes
Total shared memory per multiprocessor: 98304 bytes 98304 bytes 65536 bytes
Total number of registers available per block: 65536 65536 65536
Warp size: 32 32 32
Maximum number of threads per multiprocessor: 2048 2048 1024
Maximum number of threads per block: 1024 1024 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) (1024, 1024, 64) (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) (2147483647, 65535, 65535) (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes 2147483647 bytes 2147483647 bytes
Texture alignment: 512 bytes 512 bytes 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s) Yes with 7 copy engine(s) Yes with 6 copy engine(s)
Run time limit on kernels: Yes No Yes
Integrated GPU sharing Host Memory: No No No
Support host page-locked memory mapping: Yes Yes Yes
Alignment requirement for Surfaces: Yes Yes Yes
Device has ECC support: Disabled Enabled Disabled
Device supports Unified Addressing (UVA): Yes Yes Yes
Device supports Managed Memory: Yes Yes Yes
Device supports Compute Preemption: Yes Yes Yes
Supports Cooperative Kernel Launch: Yes Yes Yes
Supports MultiDevice Co-op Kernel Launch: No Yes No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 0 / 177 / 0 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
这里我们先关注这几个值 (以 V100 为例):
// int dev = 0;
// cudaSetDevice(dev);
// cudaDeviceProp deviceProp;
// cudaGetDeviceProperties(&deviceProp, dev);
(080) Multiprocessors, (064) CUDA Cores/MP: 5120 CUDA Cores // deviceProp.multiProcessorCount, _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor)
Maximum number of threads per multiprocessor: 2048 // deviceProp.maxThreadsPerMultiProcessor
Maximum number of threads per block: 1024 // deviceProp.maxThreadsPerBlock
Max dimension size of a thread block (x,y,z): (1024, 1024, 64) // deviceProp.maxThreadsDim[0], deviceProp.maxThreadsDim[1], deviceProp.maxThreadsDim[2]
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) // deviceProp.maxGridSize[0], deviceProp.maxGridSize[1], deviceProp.maxGridSize[2]
deviceProp.maxThreadsPerBlock=1024: CUDA 编程模型视角里 block 的三个维度之积不能超过 1024
warp 与 CUDA Core 的关系: 这篇 博客 里有个误解是 1 个 CUDA Core 就对应 1 个 warp, 但根据这个问答:
Now your Card has a total Number of 384 cores on 2 SMs with 192 cores each. The CUDA core count represents the total number of single precision floating point or integer thread instructions that can be executed per cycle. Do not consider CUDA cores in any calculation.
我们在使用这种方式调用核函数时:
int a, b, c; // 每个 block 的 thread 数为 a*b*c
int x, y, z; // 调用整个核函数完成整个功能需要 x*y*z 个 block
dim3 threads(a, b, c);
dim3 grid(x, y, z);
kernel_fun <<<grid, threads>>> ();