Technology
DistilBERT
DistilBERT is a compact, high-efficiency transformer model: 40% smaller and 60% faster than BERT, while maintaining 97% of its performance on GLUE benchmarks.
DistilBERT is a distilled version of the BERT base model, engineered for high computational efficiency. The architecture is simplified: it cuts the number of transformer layers in half (from 12 to 6) and removes the token-type embeddings and the pooler. This results in a 40% reduction in parameters and a 60% increase in inference speed compared to BERT. The model is trained using knowledge distillation, where a smaller student model learns from the larger BERT teacher model via a triple loss function (language modeling, distillation, and cosine-distance losses). This process allows DistilBERT to retain 97% of BERT's language understanding capabilities, making it ideal for low-latency, resource-constrained, and on-device NLP applications.
Related technologies
Recent Talks & Demos
Showing 1-3 of 3