Technology

Sparse Autoencoders

Sparse Autoencoders (SAEs) enforce a sparsity constraint on the hidden layer to learn disentangled, interpretable features from high-dimensional, unlabeled data.

Sparse Autoencoders (SAEs) are a powerful variant of the standard autoencoder structure. The core mechanism is a sparsity constraint: only a small fraction of hidden layer neurons activate for any given input, forcing the network to focus on the most informative features. This is enforced by adding a penalty term, such as L1 regularization or KL divergence, to the reconstruction loss function. The result is a compact, overcomplete, and *disentangled* feature representation, which is critical for model interpretability. Research teams at Anthropic and OpenAI now deploy SAEs as a primary tool to decompose the internal reasoning of large language models (LLMs) like Claude 3 and GPT-4, breaking down complex, polysemantic neurons into crisp, human-understandable features.

https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html

2 projects · 2 cities

Related technologies

BERT 179 BLOOM 115 datasets 6 Embeddings 22 Gemma Scope 1 Goodfire API 1 GPT-3 191 GPT-4 528 Hugging Face Hub 2 Llama-2 227 PaLM 2 116 React 194 RoBERTa 118 Transformers 146

Recent Talks & Demos

Showing 1-2 of 2

Members-Only

Unstructured data visualization

Atlanta Feb 27

Transformers datasets

SAEs for LLM Steering

Mumbai Nov 23

Sparse Autoencoders GPT-4