Technology

cuBLASLt

cuBLASLt is NVIDIA's lightweight, high-performance library for highly-optimized General Matrix-to-matrix Multiply (GEMM) operations.

cuBLASLt (cuBLAS Light) delivers maximum throughput for critical deep learning and HPC workloads by focusing exclusively on advanced GEMM operations (Level 3 BLAS). It provides a flexible, multi-stage API that enables developers to programmatically select optimal algorithms and heuristics for specific GPU architectures, such as leveraging Tensor Cores on NVIDIA A100 or H100 GPUs. The library is engineered for low-latency kernel launches and supports mixed-precision compute (FP16, BF16, TF32, INT8), often employing kernel fusion to combine multiple operations and minimize overhead. This targeted optimization makes it the go-to tool for accelerating large-scale AI training and inference models.

https://docs.nvidia.com/cuda/cublas/index.html

1 project · 1 city

Related technologies

Boost 6 CUDA 21 Folly 1 LMAX Disruptor 1 Rust 72 Shared memory 2

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Slashing LLM Kernel Overhead

Berlin Aug 13

Rust CUDA