Members-Only
Recent Talks & Demos are for members only
You must be an AI Tinkerers active member to view these talks and demos.
smolR1
Demonstrating a reproducible DeepSeek R1 implementation using Qwen2.5B‑0.5B on two 4090 GPUs, providing a compact, stable GRPO baseline for rapid RL experimentation.
reproducing DeepSeek’s R1 on the smallest scale with Qwen2.5B-0.5B on two 4090 GPUs.
a smol and stable baseline for rapid experimentation.
Reproduces DeepSeek R1 Zero using Qwen2.5-0.5B on two 4090 GPUs.
- GPT-4GPT-4 is OpenAI’s large multimodal model: it processes both text and image inputs, delivering human-level performance on complex professional and academic benchmarks.This is OpenAI’s latest milestone in scaling deep learning: a large multimodal model accepting both text and image inputs. It demonstrates a significant capability leap over its predecessor, scoring in the top 10% on a simulated bar exam (GPT-3.5 scored in the bottom 10%). The model handles nuanced instructions and long-form content, supporting context windows up to 32,768 tokens (32K model). This capacity allows processing up to 25,000 words in a single, complex prompt. GPT-4 is engineered for enhanced reliability, steerability, and advanced reasoning across diverse tasks.
- Claude-3Claude-3 is Anthropic's state-of-the-art multimodal model family (Opus, Sonnet, Haiku), setting new industry benchmarks for intelligence, speed, and vision capabilities.Claude-3, developed by Anthropic, is a powerful family of three generative AI models: Opus, Sonnet, and Haiku. Opus, the flagship, excels in complex reasoning, outperforming peers on key benchmarks (MMLU, GPQA) and supporting a 200,000-token context window. Sonnet offers an optimal balance for enterprise workloads, delivering performance that is 2x faster than its predecessor, Claude 2.1. Haiku is the fastest and most cost-effective option, capable of processing a 10,000-token research paper (including charts) in under three seconds. All three models are multimodal, featuring strong vision capabilities for analyzing charts, diagrams, and PDFs alongside text, enabling advanced data extraction and analysis.
- Llama-2Llama 2 is Meta AI's powerful, openly accessible family of large language models (LLMs), featuring models from 7B to 70B parameters for research and commercial applications.Llama 2 is Meta AI's next-generation LLM family, released for free research and commercial use. The collection includes both pre-trained foundation models and instruction-tuned 'Chat' variants, scaling from 7 billion (7B) up to 70 billion (70B) parameters. Key technical upgrades over Llama 1 involve training on 2 trillion tokens (40% more data) and doubling the context length to 4096 tokens. The Llama-2-chat models were rigorously aligned using Reinforcement Learning from Human Feedback (RLHF), positioning them as a top-tier, openly available option for developers building advanced generative AI solutions.
- Hugging Face TRLTRL (Transformer Reinforcement Learning) simplifies post-training of language models using advanced RL and alignment methods like DPO, PPO, and SFT.TRL is the full-stack library for post-training foundation models (LLMs), built directly on the Hugging Face `transformers` ecosystem. It provides a suite of dedicated trainers: use `SFTTrainer` for Supervised Fine-Tuning, `RewardTrainer` for preference modeling, and `DPOTrainer` or `PPOTrainer` for core Reinforcement Learning (RL) alignment methods. The library is engineered for efficiency and scale: it integrates with `PEFT` (Parameter-Efficient Fine-Tuning) for memory-conscious training (LoRA/QLoRA) and leverages `Accelerate` to scale training across single GPUs to multi-node clusters.
- vLLMvLLM is the high-throughput, memory-efficient LLM inference engine: it leverages PagedAttention to maximize GPU utilization and cut serving costs.This is the engine for scaling LLM inference: vLLM (Virtual Large Language Model) is an open-source library engineered for high-throughput and low-latency serving. Its core innovation is PagedAttention, a memory management technique inspired by OS virtual memory, which efficiently handles the Key-Value (KV) cache. This optimization drastically reduces memory overhead—up to 90% in some reported cases—and allows for continuous batching of requests. The result: significantly higher request capacity on the same hardware, lower GPU usage, and a production-ready, cost-effective serving system that supports popular models like Llama and Mistral, complete with an OpenAI-compatible API server.
Related projects
AI Computer
Berlin
Learn how to build a desktop PC with an RTX 3090 for local AI workloads, covering hardware assembly, software…
Aura: A Locally Hosted AI Gaming Companion
DC
Learn how to combine local screen capture, vision, speech recognition, LLM, and text‑to‑speech into a real-time, offline, user‑controlled…
Building a Sovereign Multi-GPU AI Infrastructure in a European Data Center (in Less Than One Year)
Cologne
How a startup built a sovereign multi‑GPU AI platform in under a year, using Kubernetes, Ray actors, MongoDB,…
Deep RL for User Experience
Chicago
Learn how to use Ray’s distributed tuning and parallel processing to scale reinforcement learning predictions and training, including…
Going from 0 to 1 with the help of AI and RL
New York City
Learn how to generate synthetic user personas with Anthropic Claude, conduct AI‑driven interviews, and refine startup hypotheses using…
Teaching small language models a thing or two
Amsterdam
Learn about finetuning small language models for specific tasks using limited data, exploring how this approach can efficiently…