2026 April 21 CUDA, CUTE, Vector Addition, GPU Programming

Vector Addition: From Naive CUDA to H100-Optimized

Vector addition is the “hello world” of GPU programming: C[i] = A[i] + B[i] for every element. It’s trivially parallel — every element is independent — which makes it the perfect playground to learn how memory access patterns determine GPU performance.

Why GPU? A CPU processes elements one at a time (or a few at a time with SIMD). A GPU has thousands of lightweight cores that can process thousands of elements concurrently. For vector addition on 25 million elements, a CPU iterates through the array; a GPU launches many blocks of threads and schedules them across SMs in waves until the whole vector is covered.

Background: How a GPU Executes Work

Before diving into code, let’s establish the execution model:

Thread: The smallest unit of execution. Each thread runs the same kernel code but on different data.
Block: A group of threads (typically 128–1024) that can share fast on-chip memory and synchronize with each other.
Grid: The collection of all blocks launched for a kernel. The grid covers the entire problem.
Warp: A hardware scheduling unit of 32 threads within a block. All 32 threads in a warp execute the same instruction simultaneously (SIMT).

When you launch a kernel, the GPU maps your grid of blocks onto its Streaming Multiprocessors (SMs). Each SM runs one or more blocks concurrently.

Here’s how these levels relate to each other:

Grid (entire problem)
├── Block 0  (up to 1024 threads)
│   ├── Warp 0  (threads 0–31)
│   ├── Warp 1  (threads 32–63)
│   └── ...
├── Block 1
│   ├── Warp 0
│   └── ...
└── ...

You specify the grid and block sizes at launch time. The hardware handles warps automatically — you never create warps explicitly.

Understanding the H100 Memory System

The NVIDIA H100 (Hopper architecture) has a memory hierarchy designed for massive throughput:

Level	Size	Bandwidth	Latency
HBM3 (Global Memory)	80 GB	~3.35 TB/s	~400 cycles
L2 Cache	50 MB	~12 TB/s	~200 cycles
L1 / Shared Memory	256 KB per SM	~33 TB/s (aggregate)	~30 cycles
Registers	256 KB per SM	Instant	1 cycle

Vector addition is memory-bound — the computation (one add per element) is trivial. The GPU spends most of its time moving data to and from HBM3. Our job is to maximize memory bandwidth utilization.

What does “memory-bound” mean? Every kernel is bottlenecked by either computation (math instructions) or memory (loading/storing data). Vector addition does 1 add per 12 bytes transferred (read A, read B, write C — each 4 bytes for float32). The GPU can do far more math than this in the time it takes to fetch the data, so the memory bus is the bottleneck. Optimizing a memory-bound kernel means getting data in and out of HBM as fast as possible.

Part 1: The Naive Kernel

A CUDA kernel is a function that runs on the GPU. The __global__ keyword tells the compiler this function is called from the CPU but executes on the GPU. Every thread runs the same kernel code, but each thread has a different threadIdx and blockIdx — built-in variables that let each thread figure out which element it should work on.

__global__ void naive_add_kernel(const float* A, const float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

void solve(const float* A, const float* B, float* C, int N) {
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;
    naive_add_kernel<<<blocks, threads_per_block>>>(A, B, C, N);
}

The solve function is the host-side launcher — it runs on the CPU. The <<<blocks, threads_per_block>>> syntax is CUDA’s kernel launch syntax: it tells the GPU how many blocks to create and how many threads per block. The formula (N + threads_per_block - 1) / threads_per_block is ceiling division — it ensures we launch enough threads to cover all N elements, even when N isn’t perfectly divisible by the block size.

How It Works

Index calculation: Each thread computes its unique global index idx = blockIdx.x * blockDim.x + threadIdx.x. Think of it like this: if you have 256 threads per block, then block 0 has threads 0–255, block 1 has threads 256–511, and so on. blockIdx.x * blockDim.x gives the starting index of the block, and threadIdx.x is the thread’s position within that block.
Bounds check: if (idx < N) prevents out-of-bounds access when N isn’t a multiple of the block size. For example, if N = 1000 and we launch 4 blocks of 256 threads (1024 threads total), the last 24 threads have idx >= 1000 and must do nothing.
Coalesced access: Adjacent threads access adjacent memory addresses, so warps read contiguous memory — the hardware coalesces these into efficient 128-byte transactions.

Why This Is Slow

The naive kernel often achieves only a fraction of peak HBM3 bandwidth on H100-class GPUs. One important limiter is instruction overhead: each thread issues scalar load/store instructions for a single element. The memory controller coalesces warp-level accesses, but the scheduler still has to process many individual memory instructions.

To understand why, consider what each thread does: it issues 2 load instructions (one for A[idx], one for B[idx]), 1 add, and 1 store. That’s 3 memory instructions per element, each moving only 4 bytes. The GPU’s instruction pipeline has limited throughput — it can only dispatch so many instructions per cycle. When every instruction moves so little data, the pipeline becomes the bottleneck, not the memory bus itself.

Part 2: Vectorized with `float4`

Key Insight: Vectorized Memory Access

The GPU memory bus is wide — each memory transaction fetches 128 bytes (a full cache line). In the naive kernel, each thread requests only 4 bytes, but the hardware still fetches the full 128-byte cache line. The data isn’t wasted (neighboring threads use adjacent parts of the same cache line), but each thread pays the cost of issuing a separate instruction for just 4 bytes.

Vectorization means loading more data per instruction. Instead of each thread loading one 4-byte float, each thread can load 4 floats (16 bytes) in a single 128-bit instruction, as long as the pointer is 16-byte aligned:

Naive:      Thread 0 loads A[0]     → 4 bytes,  1 instruction
Vectorized: Thread 0 loads A[0:4]   → 16 bytes, 1 instruction (LDG.E.128)

Benefits:

4x fewer load instructions → less instruction scheduler pressure
Better bus utilization → each instruction carries more useful data
More in-flight bytes → better latency hiding per Little’s Law

The Vectorized Kernel

The idea is simple: treat the float array as an array of float4 (a struct of 4 floats), so each thread processes 4 elements at once. Since N might not be divisible by 4, we handle the leftover “tail” elements separately with scalar loads.

__global__ void vectorized_add_kernel(const float* A, const float* B, float* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int vec_N = N / 4;

    if (idx < vec_N) {
        const float4* A4 = reinterpret_cast<const float4*>(A);
        const float4* B4 = reinterpret_cast<const float4*>(B);
        float4* C4 = reinterpret_cast<float4*>(C);

        float4 a = A4[idx];
        float4 b = B4[idx];

        C4[idx] = make_float4(
            a.x + b.x,
            a.y + b.y,
            a.z + b.z,
            a.w + b.w
        );
    }

    // Handle remaining elements (when N is not divisible by 4)
    int tail_start = vec_N * 4;
    int tail_idx = tail_start + threadIdx.x;
    if (threadIdx.x < (N - tail_start) && blockIdx.x == 0) {
        C[tail_idx] = A[tail_idx] + B[tail_idx];
    }
}

void solve(const float* A, const float* B, float* C, int N) {
    int threads_per_block = 256;
    int vec_N = N / 4;
    int blocks = (vec_N + threads_per_block - 1) / threads_per_block;
    blocks = blocks > 0 ? blocks : 1;  // still launch one block for N < 4 tail elements
    vectorized_add_kernel<<<blocks, threads_per_block>>>(A, B, C, N);
}

How `float4` Enables Vectorization

float4 is a CUDA built-in type — a struct with four members (x, y, z, w), each a 32-bit float, 16 bytes total. The type has 16-byte alignment, and pointers returned by cudaMalloc are aligned enough for this use case. If you pass an offset pointer such as A + 1, the reinterpret_cast<float4*> version is no longer safe because the address is not 16-byte aligned. When alignment is valid, the compiler can emit a single 128-bit load/store instruction that moves all 16 bytes in one shot — the same data movement that would take 4 scalar float loads/stores.

The reinterpret_cast<const float4*>(A) tells the compiler to treat the float array as an array of float4. This doesn’t copy or rearrange any data — it just changes how the pointer arithmetic works. A4[idx] now accesses 4 contiguous floats starting at A[idx * 4]. The cast is correct only when the base pointer is 16-byte aligned and the vectorized region contains complete groups of 4 floats.

What Is Memory Coalescing?

GPUs don’t fetch individual bytes from memory. Instead, the memory controller works in transactions — fixed-size chunks of 128 bytes (a cache line). When 32 threads in a warp access addresses within the same 128-byte aligned region, the hardware serves all requests in a single memory transaction:

Good (coalesced):
  Thread 0 → address 0x1000    (4 bytes)
  Thread 1 → address 0x1004
  Thread 2 → address 0x1008
  ...
  → 1 transaction (128 bytes, fully utilized)

Bad (strided):
  Thread 0 → address 0x1000
  Thread 1 → address 0x2000
  → 32 separate transactions (mostly wasted bytes)

With float4, each thread accesses 16 contiguous bytes, and adjacent threads access adjacent float4 elements. This means a warp issues 32 × 16 = 512 bytes of loads, served in 4 perfectly coalesced 128-byte transactions — with zero wasted bytes.

Part 3: Grid-Stride Loop (Production Pattern)

In Parts 1 and 2, we launched one thread per element (or per float4 chunk). For N = 25 million, that means launching ~6.25 million threads — which means ~24,400 blocks. This works, but it’s not ideal: launching that many blocks has overhead, and the GPU only has 132 SMs. Most blocks sit in a queue waiting for an SM to become available.

The grid-stride loop pattern takes a different approach: launch a fixed, modest number of threads, and have each thread process multiple elements by walking through the array in strides.

__global__ void gridstride_add_kernel(const float* A, const float* B, float* C, int N) {
    int vec_N = N / 4;
    const float4* A4 = reinterpret_cast<const float4*>(A);
    const float4* B4 = reinterpret_cast<const float4*>(B);
    float4* C4 = reinterpret_cast<float4*>(C);

    int stride = gridDim.x * blockDim.x;

    // Main vectorized loop
    for (int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < vec_N; idx += stride) {
        float4 a = A4[idx];
        float4 b = B4[idx];
        C4[idx] = make_float4(a.x + b.x, a.y + b.y, a.z + b.z, a.w + b.w);
    }

    // Handle tail elements
    for (int idx = vec_N * 4 + blockIdx.x * blockDim.x + threadIdx.x; idx < N; idx += stride) {
        C[idx] = A[idx] + B[idx];
    }
}

void solve(const float* A, const float* B, float* C, int N) {
    int threads_per_block = 256;
    int vec_N = N / 4;
    int blocks = (vec_N + threads_per_block - 1) / threads_per_block;
    blocks = blocks < 256 ? blocks : 256;
    blocks = blocks > 0 ? blocks : 1;  // required when N < 4
    gridstride_add_kernel<<<blocks, threads_per_block>>>(A, B, C, N);
}

Why Grid-Stride?

The key line is for (int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < vec_N; idx += stride). Each thread starts at its global index and jumps forward by stride (the total number of threads in the grid) on each iteration. This guarantees every element is covered exactly once, and adjacent threads always access adjacent memory — preserving coalescing.

With N = 25,000,000 and float4, there are 6,250,000 vector elements to process. A grid of 256 blocks × 256 threads = 65,536 threads, so each thread handles ~95 float4 chunks. Here’s a visual of how threads walk through the array:

Thread 0:  processes elements 0, 65536, 131072, ...
Thread 1:  processes elements 1, 65537, 131073, ...
Thread 2:  processes elements 2, 65538, 131074, ...
...
Thread 65535: processes elements 65535, 131071, ...

Benefits:

Fixed grid size: the number of blocks doesn’t scale with N — you choose the grid size for optimal occupancy
Preserved coalescing: neighboring threads still access neighboring addresses in every loop iteration
Works for any N: no need for a 1:1 thread-to-element mapping

Capping Blocks at 256

An H100 SXM has 132 SMs. With 256 threads/block, a 256-block grid gives the scheduler enough work to cover the device for this simple streaming kernel while avoiding tens of thousands of tiny blocks. The exact best cap is workload- and occupancy-dependent, so treat 256 as a reasonable benchmark setting rather than a universal rule.

Performance Expectations

Kernel	Typical bandwidth target (H100 SXM, float32)	% of Peak (3.35 TB/s)
Naive (scalar loads)	~1.0–1.3 TB/s	~30–40%
Vectorized (float4)	~2.8–3.1 TB/s	~85–93%
Grid-stride + float4	~2.9–3.2 TB/s	~87–95%
Theoretical peak	3.35 TB/s	100%

The remaining gap depends on measurement details such as tensor size, launch overhead, compiler output, clocking, alignment, and memory-system effects. Always verify with a profiler or benchmark on the target GPU.

Summary

Technique	What It Does	Why It Helps
`float4` loads/stores	Each thread loads 4 floats (16 bytes) per instruction	4x fewer instructions vs scalar
`reinterpret_cast<float4*>`	Treats `float` as `float4` for 128-bit access	Required for compiler to emit `LDG.E.128`
Grid-stride loop	Each thread processes multiple chunks	Fixed grid size while preserving coalesced access
Tail handling	Scalar loop for `N % 4` remaining elements	Correctness for non-divisible N
256 threads/block	Good occupancy, multiple blocks per SM	Hides memory latency via warp switching

CuTe DSL Version

The same vector addition problem can be expressed using NVIDIA’s CuTe DSL — a Python-embedded domain-specific language from the CUTLASS library. CuTe provides higher-level abstractions (zipped_divide, TV layouts, .load()) that can generate the same kind of vectorized memory operations as the raw CUDA C++ above, but let you manipulate tensor layouts algebraically.

Why CuTe? Raw CUDA gives you full control, but as kernels grow more complex (GEMM, attention, convolution), manually managing tile shapes, thread-to-data mappings, and memory access patterns becomes error-prone. CuTe lets you express these patterns declaratively — you describe the layout, and CuTe generates the indexing math for you.

The examples below use float32 (matching the CUDA C++ section). With float32 (4 bytes per element), a 128-bit (16-byte) vectorized load fetches 4 elements per instruction.

Setup

CuTe DSL programs use PyTorch tensors as input and the cutlass Python package for kernel compilation. from_dlpack converts a PyTorch tensor into a CuTe tensor without copying data — it just wraps the same GPU memory with CuTe’s layout metadata.

import torch
import cutlass
import cutlass.cute as cute
from cutlass.cute.runtime import from_dlpack

Part 1: Naive CuTe Kernel

CuTe separates GPU code into two pieces: a kernel (@cute.kernel) that runs on the GPU, and a host function (@cute.jit) that runs on the CPU to configure and launch the kernel. This is analogous to the __global__ kernel + host solve function in CUDA C++.

@cute.kernel
def naive_add_kernel(
    gA: cute.Tensor,
    gB: cute.Tensor,
    gC: cute.Tensor,
    N: cute.Uint32,
):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    idx = bidx * bdim + tidx

    if idx < N:
        gC[idx] = gA[idx] + gB[idx]

Host Launch Function

The @cute.jit decorator marks a host-side function that configures and launches the kernel. It runs on the CPU and sets up the grid/block dimensions. The function must be named solve with the signature (A, B, C, N) to match the LeetGPU challenge interface.

@cute.jit
def solve(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor, N: cute.Uint32):
    threads_per_block = 256

    naive_add_kernel(A, B, C, N).launch(
        grid=((N + threads_per_block - 1) // threads_per_block, 1, 1),
        block=(threads_per_block, 1, 1),
    )

Running It

CuTe uses a two-step process: cute.compile JIT-compiles the kernel to GPU machine code (PTX → SASS), and then you call the compiled function. The assumed_align=16 parameter is critical — it tells the compiler that the memory pointers may be treated as 16-byte aligned. It does not realign memory for you; the input tensors must actually satisfy that alignment.

N = 25_000_000

a = torch.randn(N, device="cuda", dtype=torch.float32)
b = torch.randn(N, device="cuda", dtype=torch.float32)
c = torch.zeros(N, device="cuda", dtype=torch.float32)

a_ = from_dlpack(a, assumed_align=16)
b_ = from_dlpack(b, assumed_align=16)
c_ = from_dlpack(c, assumed_align=16)

naive_fn = cute.compile(solve, a_, b_, c_, N)
naive_fn(a_, b_, c_, N)

torch.testing.assert_close(c, a + b)

How It Works

Index calculation: Each thread gets a unique global index idx = bidx * bdim + tidx.
Bounds check: if idx < N prevents out-of-bounds access, just like if (idx < N) in the CUDA C++ version. N is passed from the host as a cute.Uint32 scalar.
1:1 mapping: Each thread processes exactly one element. This is simple but means each load is a single scalar (4 bytes for float32).
Coalesced access: Adjacent threads access adjacent elements, so warps read contiguous memory — the hardware coalesces these into efficient 128-byte transactions.

Part 2: Vectorized with `zipped_divide`

In the CUDA C++ version, we used float4 and reinterpret_cast to manually group 4 floats into a single 128-bit load. CuTe provides a higher-level abstraction for the same idea: cute.zipped_divide(tensor, tiler) partitions a tensor into fixed-size tiles. For vectorization, the tiler specifies how many contiguous elements each thread should access — 4 float32 elements = 16 bytes = one 128-bit load. The example below keeps the vectorized path and handles the last N % 4 scalar elements inside the same kernel launch.

Key Insight: Vectorized Memory Access

Instead of each thread loading one 4-byte float32, each thread can load 4 float32 values (16 bytes) in a single 128-bit instruction:

Naive:      Thread 0 loads A[0]     → 4 bytes, 1 instruction
Vectorized: Thread 0 loads A[0:4]   → 16 bytes, 1 instruction (LDG.E.128)

Benefits:

4x fewer load instructions → less instruction scheduler pressure
Better bus utilization → each instruction carries more useful data
More in-flight bytes → better latency hiding per Little’s Law

The Vectorized Kernel

@cute.kernel
def vectorized_add_kernel(
    gA: cute.Tensor,
    gB: cute.Tensor,
    gC: cute.Tensor,
    A: cute.Tensor,
    B: cute.Tensor,
    C: cute.Tensor,
    vec_N: cute.Uint32,
    N: cute.Uint32,
):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()
    bdim, _, _ = cute.arch.block_dim()

    idx = bidx * bdim + tidx

    # gA has been tiled by (4,) on the host side via zipped_divide.
    # Complete vector tiles are indexed by 0 ... floor(N/4)-1.
    # The 1st mode (4,) is the per-thread tile (4 contiguous float32 elements).
    # The 2nd mode indexes which complete tile each thread works on.
    if idx < vec_N:
        # With 16-byte alignment, .load() can compile to one 128-bit load
        a_vec = gA[(None, idx)].load()
        b_vec = gB[(None, idx)].load()

        # Elementwise add on the vector, then store (128-bit STG.E.128)
        gC[(None, idx)] = a_vec + b_vec

    # Fuse scalar cleanup into the same launch. Only block 0 needs to handle it
    # because there are at most 3 leftover float32 elements.
    tail_start = vec_N * 4
    tail_idx = tail_start + tidx
    if bidx == 0:
        if tail_idx < N:
            C[tail_idx] = A[tail_idx] + B[tail_idx]

Host-Side: Tiling with `zipped_divide`

The host function tiles each tensor before passing them to the kernel. cute.zipped_divide(A, (4,)) groups contiguous elements into 4-value tiles, and the vectorized path only consumes the first N // 4 complete tiles. The same kernel also receives the original tensors so it can handle the scalar tail without paying for a second launch. The grid is computed from N, not vec_n, so small cases like N = 1 still launch one valid block instead of crashing with a zero-sized CUDA grid.

@cute.jit
def solve(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor, N: cute.Uint32):
    threads_per_block = 256

    # Tile: each thread handles 4 contiguous float32 = 16 bytes = 128-bit
    gA = cute.zipped_divide(A, (4,))
    gB = cute.zipped_divide(B, (4,))
    gC = cute.zipped_divide(C, (4,))

    vec_n = N // 4

    # Use N for the block count so N < 4 still launches one valid block.
    blocks = (N + threads_per_block * 4 - 1) // (threads_per_block * 4)
    vectorized_add_kernel(gA, gB, gC, A, B, C, vec_n, N).launch(
        grid=(blocks, 1, 1),
        block=(threads_per_block, 1, 1),
    )

Running the Vectorized Kernel

a = torch.randn(N, device="cuda", dtype=torch.float32)
b = torch.randn(N, device="cuda", dtype=torch.float32)
c = torch.zeros(N, device="cuda", dtype=torch.float32)

a_ = from_dlpack(a, assumed_align=16)
b_ = from_dlpack(b, assumed_align=16)
c_ = from_dlpack(c, assumed_align=16)

vec_fn = cute.compile(solve, a_, b_, c_, N)
vec_fn(a_, b_, c_, N)

torch.testing.assert_close(c, a + b)

How `zipped_divide` Enables Vectorization

Let’s trace what happens step by step:

Original tensor mA:  shape (N,), layout (1,)
                     e.g. (25000000,):(1,)

After zipped_divide(mA, (4,)):
  shape:  ((4,), (6250000,))
  layout: ((1,), (4,))
           ~~~~  ~~~~~~~~~
            |        |
            |        └── indexes which tile (which thread works on it)
            └── per-thread tile: 4 contiguous float32 elements

When thread t accesses gA[(None, idx)].load():

idx selects which tile
None means “give me the entire tile” — all 4 elements
.load() can compile to a single 128-bit load because the 4 float32 values are contiguous and 16-byte aligned

Part 3: Advanced — TV Layout

For production kernels, CuTe offers TV (Thread-Value) Layout — a rank-2 layout that maps (thread_index, value_index) directly to tensor coordinates. This decouples the thread-to-data mapping from the kernel logic, making it easy to experiment with different access patterns. The version below also adds a coordinate tensor, so the final partial tile is handled by predicates inside the same kernel launch.

In the previous approaches, the kernel itself computed which data each thread should access (the index math). With TV layout, that mapping is defined outside the kernel as a layout object and passed in. The kernel simply says “give me my data” — it doesn’t need to know how the data was partitioned.

@cute.kernel
def tv_add_kernel(
    mA: cute.Tensor,
    mB: cute.Tensor,
    mC: cute.Tensor,
    cC: cute.Tensor,
    shape: cute.Shape,
    tv_layout: cute.Layout,
):
    tidx, _, _ = cute.arch.thread_idx()
    bidx, _, _ = cute.arch.block_idx()

    # Slice to get this thread-block's tile
    blk_coord = (None, bidx)
    blkA = mA[blk_coord]
    blkB = mB[blk_coord]
    blkC = mC[blk_coord]
    blkCrd = cC[blk_coord]

    # Compose block-local tensor with TV layout:
    # blkA maps (TileN,) → physical address
    # tv_layout maps (tid, vid) → (TileN,)
    # composition maps (tid, vid) → physical address
    tidfrgA = cute.composition(blkA, tv_layout)
    tidfrgB = cute.composition(blkB, tv_layout)
    tidfrgC = cute.composition(blkC, tv_layout)
    tidfrgCrd = cute.composition(blkCrd, tv_layout)

    # Slice to get this thread's fragment
    thrA = tidfrgA[(tidx, None)]
    thrB = tidfrgB[(tidx, None)]
    thrC = tidfrgC[(tidx, None)]
    thrCrd = tidfrgCrd[(tidx, None)]

    # Predicate each value so the final partial tile is safe.
    for i in cutlass.range_constexpr(cute.size(thrCrd)):
        if cute.elem_less(thrCrd[i], shape):
            thrC[i] = thrA[i] + thrB[i]

TV Layout Host Code

The host constructs a TV layout that maps 256 threads to a tile, with each thread owning 16 contiguous bytes (4 float32 elements). It also constructs an identity-coordinate tensor and tiles it with the same layout; the kernel uses those coordinates for bounds checks instead of launching a separate tail kernel.

Breaking down the layout construction:

Thread layout (256,): 256 threads in a 1D arrangement.
Value layout (coalesced_ldst_bytes,): each thread reads 16 bytes. recast_layout converts the byte-level layout to element-level (for float32: 16 bytes = 4 elements).
make_layout_tv: combines thread and value layouts into a single TV layout and returns the tile shape tiler_n that zipped_divide needs.

@cute.jit
def solve(A: cute.Tensor, B: cute.Tensor, C: cute.Tensor, N: cute.Uint32):
    coalesced_ldst_bytes = 16  # 128-bit = 16 bytes
    threads_per_block = 256

    assert all(t.element_type == A.element_type for t in [A, B, C])
    dtype = A.element_type

    # Thread layout: 256 threads in 1D
    thr_layout = cute.make_ordered_layout((threads_per_block,), order=(0,))
    # Value layout: each thread reads 16 bytes, recast to element type
    val_layout = cute.make_ordered_layout((coalesced_ldst_bytes,), order=(0,))
    val_layout = cute.recast_layout(dtype.width, 8, val_layout)
    tiler_n, tv_layout = cute.make_layout_tv(thr_layout, val_layout)

    mA = cute.zipped_divide(A, tiler_n)
    mB = cute.zipped_divide(B, tiler_n)
    mC = cute.zipped_divide(C, tiler_n)

    idC = cute.make_identity_tensor(C.shape)
    cC = cute.zipped_divide(idC, tiler_n)

    elems_per_thread = coalesced_ldst_bytes // 4  # float32
    elems_per_block = elems_per_thread * threads_per_block
    blocks = (N + elems_per_block - 1) // elems_per_block

    tv_add_kernel(mA, mB, mC, cC, C.shape, tv_layout).launch(
        grid=(blocks, 1, 1),
        block=(threads_per_block, 1, 1),
    )

Why TV Layout Matters

The TV layout separates what each thread does from how data is arranged in memory. You can change the access pattern (e.g., different tile shapes, different thread-to-data mappings) by modifying only the layout construction — the kernel code stays the same. The coordinate tensor adds a clean boundary predicate, so the same kernel works for arbitrary N.

The tradeoff is that value-by-value predicates may prevent the compiler from emitting the same clean 128-bit .load() / vector-store sequence shown in Part 2. In practice, this TV version is often the more robust CuTe pattern for arbitrary shapes, while the fused Part 2 version is the more direct route when the benchmark rewards explicit 128-bit vector loads.

CuTe DSL Performance

Kernel	Bandwidth (H100, float32)	% of Peak (3.35 TB/s)
Naive (scalar loads)	~1.0–1.3 TB/s	~30–40%
Vectorized (zipped_divide)	~2.8–3.1 TB/s	~85–93%
TV Layout + predicates	workload-dependent	benchmark it
Theoretical peak	3.35 TB/s	100%

CuTe DSL Summary

Technique	What It Does	Why It Helps
`zipped_divide(tensor, (4,))`	Tiles tensor so each thread gets 4 float32 elements	Enables 128-bit vectorized load/store
`.load()` on tiled slice	Can emit a single 128-bit load	4x fewer instructions vs scalar when aligned
Coordinate tensor + `cute.elem_less`	Predicates the final partial tile	Correctness for arbitrary `N` in one launch
`from_dlpack(t, assumed_align=16)`	Tells CuTe the pointer is 16-byte aligned	Required for compiler to emit vectorized instructions; the tensor must actually be aligned
`cute.composition(tensor, tv_layout)`	Maps (thread, value) → physical address	Decouples access pattern from kernel logic

What’s Next

Vector addition is memory-bound — the GPU barely computes anything. For compute-bound kernels (matrix multiply, convolution), the real optimization challenges appear: tiled layouts, shared memory staging, and warp-specialized pipelines. The patterns you learned here — float4 vectorization in raw CUDA and zipped_divide / TV layouts in CuTe — are the exact building blocks used in high-performance GEMM and Flash Attention implementations.

Vector Addition: From Naive CUDA to H100-Optimized

Vector Addition: From Naive CUDA to H100-Optimized

Background: How a GPU Executes Work

Understanding the H100 Memory System

Part 1: The Naive Kernel

How It Works

Why This Is Slow

Part 2: Vectorized with float4

Key Insight: Vectorized Memory Access

The Vectorized Kernel

How float4 Enables Vectorization

What Is Memory Coalescing?

Part 3: Grid-Stride Loop (Production Pattern)

Why Grid-Stride?

Capping Blocks at 256

Performance Expectations

Summary

CuTe DSL Version

Setup

Part 1: Naive CuTe Kernel

Host Launch Function

Running It

How It Works

Part 2: Vectorized with zipped_divide

Key Insight: Vectorized Memory Access

The Vectorized Kernel

Host-Side: Tiling with zipped_divide

Running the Vectorized Kernel

How zipped_divide Enables Vectorization

Part 3: Advanced — TV Layout

TV Layout Host Code

Why TV Layout Matters

CuTe DSL Performance

CuTe DSL Summary

What’s Next

Part 2: Vectorized with `float4`

How `float4` Enables Vectorization

Part 2: Vectorized with `zipped_divide`

Host-Side: Tiling with `zipped_divide`

How `zipped_divide` Enables Vectorization