2X2 Matrix Multiply 2X1

02_matrix_multiply.py

Naive matrix multiply: C = A * B. Each thread computes one element of C: C[row, col] = sum_k A[row, k] * B[k, col] # 2D indexing: derive global row/col from block and thread indices. # blockIdx.y, ...

IEEE

Fused FP8 4-Way Dot Product With Scaling and FP32 Accumulation

Abstract: For a variety of ML applications, generalized matrix multiply (GEMM) with DOT product is the most computationally intensive operation. This paper presents a microarchitecture exploration of ...

GitHub

llama.cpp — AMD XDNA2 NPU Backend (RyzenAI npu5)

This is a fork of llama.cpp with a custom ggml backend that offloads matrix multiplication to the AMD XDNA2 NPU found in Ryzen AI MAX processors (e.g. Ryzen AI MAX 385). The NPU backend accelerates ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results

02_matrix_multiply.py

Fused FP8 4-Way Dot Product With Scaling and FP32 Accumulation

llama.cpp — AMD XDNA2 NPU Backend (RyzenAI npu5)

Trending now