📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels, FA2, HGEMM via MMA and CuTe (~99% TFLOPS of cuBLAS/FA2 🎉).
-
Updated
Apr 1, 2025 - Cuda
📚Modern CUDA Learn Notes: 200+ Tensor/CUDA Cores Kernels, FA2, HGEMM via MMA and CuTe (~99% TFLOPS of cuBLAS/FA2 🎉).
🚀🚀🚀 This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTX and High Performance Computing (HPC) projects.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
GEMM and Winograd based convolutions using CUTLASS
study of cutlass
Multiple GEMM operators are constructed with cutlass to support LLM inference.
A cutlass cute implementation of headdim-64 flashattentionv2 TensorRT plugin for LightGlue. Run on Jetson Orin NX 8GB with TensorRT 8.5.2.
pytorch implements block sparse
Add a description, image, and links to the cutlass topic page so that developers can more easily learn about it.
To associate your repository with the cutlass topic, visit your repo's landing page and select "manage topics."