Random Samples Catalog

Complete catalog of all Random Samples episodes, our weekly AI research seminar series.

26 episodes · ← Back to Videos

Date Talk Speaker
May 15, 2026 Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty
Abstract

Reinforcement learning (RL) improves language models' reasoning on difficult QA tasks, but common binary correctness rewards can encourage guessing, harm calibration, and increase hallucinations. We introduce RLCR (Reinforcement Learning with Calibration Rewards), which trains models to output both answers and confidence estimates by adding a Brier-score calibration reward to standard correctness rewards. Results show that existing reasoning training methods can be straightforwardly modified to additionally optimize for calibration, and that this improves in turn their accuracy, robustness, and scalability.

Isha Puri
May 01, 2026 Agent Factories for High-Level Synthesis
Abstract

We study whether general-purpose AI coding agents, with no hardware-specific training, can optimize chip designs. Our method, HLS Factory, builds and coordinates multiple autonomous agents that iteratively modify code, compile it to hardware circuits, and learn from the results. By scaling the number of agents, we see speedup go up across twelve benchmarks, with the best results on harder problems where agents discover algorithmic restructurings that traditional search methods cannot reach. In our best case, one agent out of ten discovered an algorithmic restructuring that cut latency 11x and area 32x, a design no parameter search or single agent would have found.

Abhishek Bhandwaldar, Akash Srivastava
Apr 17, 2026 Towards Self-Adapting Models
Abstract

LLMs are powerful but static, lacking the ability to update their weights for new tasks or knowledge. We introduce Self-Adapting LLMs (SEAL), a framework where models generate their own finetuning data and update instructions. Given an input, the model produces a self-edit that may restructure information, set optimization parameters, or use tools for augmentation and training. These edits enable persistent weight updates via supervised finetuning. An RL loop trains the model to create effective self-edits using downstream performance as the reward. Unlike prior methods that rely on external modules, SEAL allows models to control their own adaptation, showing promising results in knowledge integration and few-shot generalization.

Jyo Pari
Apr 10, 2026 Self-Distillation as a New Framework for Continual Learning
Abstract

Catastrophic forgetting remains a central challenge in continual learning and the practical fine-tuning of large models, yet even basic questions about why different training algorithms forget at different rates remain unclear. In this talk, we present our discovery that when fine-tuning large models on a new task, on-policy learning consistently forgets far less than supervised fine-tuning, even when both methods reach similar performance. Building upon this insight, we present Self-Distillation, a new family of algorithms for LLM fine-tuning. Self-Distillation leverages in-context learning by using a demonstration-conditioned model as its own teacher, generating on-policy training signals that preserve prior capabilities while acquiring new skills.

Idan Shenfeld
Apr 03, 2026 Reasoning with Sampling: Your Base Model is Smarter Than You Think
Abstract

Frontier reasoning models have exhibited incredible capabilities across a wide array of disciplines, driven by posttraining large language models with reinforcement learning. However, despite the widespread success of this paradigm, much of the literature has been devoted to disentangling truly novel behaviors that emerge during RL but are not present in the base models. In our work, we approach this question from a different angle, instead asking whether comparable reasoning capabilities can be elicited from base models at inference time by pure sampling, without any additional training. Inspired by Markov chain Monte Carlo (MCMC) techniques for sampling from sharpened distributions, we propose a simple iterative sampling algorithm leveraging the base models' own likelihoods. Over different base models, we show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL on a wide variety of single-shot tasks, including MATH500, HumanEval, and GPQA.

Aayush Karan
Mar 27, 2026 Advancing Real-World Long-Horizon Tool-Using LLM Agents
Abstract

Recent progress in LLM agents has made tool use practical, but two challenges remain: evaluating agents in realistic environments and scaling them to long-horizon tasks with finite context. This presentation highlights two recent works that address these issues. MCP-Bench provides a realistic evaluation framework using 28 live MCP servers and 250 tools, testing schema understanding, multi-step planning, grounding, and cross-domain orchestration, and exposing major weaknesses in current frontier models. Meme(RL) introduces an indexed memory framework that keeps a compact summary in context while storing full past interactions externally for retrieval. Trained with reinforcement learning, it helps agents retain critical evidence, improve long-horizon performance, and reduce context usage.

Zhenting Wang
Mar 20, 2026 Nondeterminism in LLM Inference & Training-Rollout Mismatch
Abstract

LLM generation is not deterministic even when the temperature is set to zero. This issue is more pronounced in RL, where the training and rollout engines naturally operate with different batch sizes, kernel selections, and parallelization strategies. This training-rollout mismatch problem leads to suboptimal performance and training collapse, especially for MoE models. In this talk, we analyze why this happens and how to solve the problem from the system level by building deterministic GPU kernels.

Xinheng Ding
Mar 13, 2026 Evolutionary Arms Races Between LLMs in Core War
Abstract

Large language models are increasingly being used to evolve solutions to problems, inspired by biological evolution. Unlike biology, most LLM-evolution frameworks are static optimization problems, overlooking the open-ended adversarial dynamics of real-world evolutionary processes. Here, we study Digital Red Queen (DRQ), a simple self-play algorithm that embraces these Red Queen dynamics via continual adaptation to a changing objective. DRQ uses an LLM to evolve assembly-like programs, called warriors, which compete for control of a virtual machine in Core War, a Turing-complete environment studied in artificial life and connected to cybersecurity. Over many rounds, we observe that warriors become increasingly general. Interestingly, warriors also become less behaviorally diverse across independent runs, indicating a convergence pressure toward a general-purpose behavioral strategy, much like convergent evolution in nature.

Akarsh Kumar
Mar 06, 2026 Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
Abstract

This talk presents Agentic Context Engineering (ACE), an upcoming CLR 2026 paper. ACE reframes context adaptation as a new medium of continual learning without weight updates, where agents evolve their own playbooks over time. Instead of repeatedly rewriting prompts or memories, which often leads to context collapse, ACE uses a structured generate-reflect-curate loop to incrementally accumulate, refine, and organize strategies based on agent execution feedback. This enables self-evolving agents that improve online, even without labeled supervision, while remaining efficient and stable under long contexts.

Qizheng Zhang
Feb 27, 2026 Few-Shot Diffusion Language Models
Abstract

Diffusion large language models (DLLMs) can generate text quickly by predicting many tokens at once. However, they usually need many refinement steps to produce high-quality results, which reduces their speed advantage. In this talk, we introduce T3D, a framework that helps DLLMs work well with only a few steps. It is based on two main ideas: (1) trajectory self-distillation, where a smaller model learns directly from the full step-by-step outputs of a stronger teacher model, reducing the mismatch between training and actual use, and (2) direct discriminative optimization, a training method that pushes the model to choose clearer, more confident predictions. T3D outperforms other few-step methods on reasoning and code generation tasks, significantly closing the gap with full-step diffusion models.

Tunyu Zhang
Oct 03, 2025 Instance-Adaptive Inference-Time Scaling with Calibrated Process Reward Models
Abstract

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for LLMs. However, even state-of-the-art PRMs can be poorly calibrated. To address this, we present a calibration approach that adjusts PRM outputs to better align with true success probabilities. We introduce an instance-adaptive scaling framework that dynamically adjusts the inference budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Experiments on math reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective adaptive scaling, and (iii) the proposed IAS strategy reduces inference costs while maintaining final accuracy.

Young Jin Park
Aug 08, 2025 Hopscotch: Discovering and Skipping Redundancies in Language Models
Abstract

Join us for a presentation and discussion on Hopscotch, a method produced by Red Hat's AI Innovation team aimed at understanding and reducing redundancy in language models. With Hopscotch, we can skip entire attention blocks within a model, offering improved inference speeds and memory savings with minimal quality drop-off. This coarse-grained method also provides insights into the frequency of task-specific redundancies within language models, and just how inefficient large models may be today.

Mustafa Eyceoz
Jul 18, 2025 Grounding Feedback is All You Need: Aligning Small Vision-Language Models
Abstract

While recent vision-language models (VLMs) excel at integrating visual and linguistic information, their performance hinges on vast quantities of curated image-text pairs. This reliance makes the alignment process both time-consuming and resource-intensive. In this talk, I will introduce Sampling-based Vision Projection (SVP), a novel framework that improves vision-language alignment using automated feedback and minimal human supervision. Our results show that SVP significantly enhances image captioning, improves object recall, and reduces hallucination, enabling smaller models to match the performance of much larger systems.

Giorgio Giannone
Jul 11, 2025 On scalable RL in the era of agentic LLMs
Abstract

As AI progresses beyond individual models toward tool-using, multi-agent systems, a new frontier is emerging, one where language models act over time, coordinate, and interface with real-world environments. These agentic systems promise to automate complex tasks, but realizing their full potential will require more than just scale. This talk will focus on the broader challenge of optimizing interactive AI behaviors, and the limitations of conventional supervised fine-tuning in such settings. I will introduce async-grpo, a novel high-performance reinforcement learning library purpose-built for training language models.

Aldo Pareja
Jun 27, 2025 LLM Meets Cache: From Application to Architecture
Abstract

Large Language Models are powerful yet resource-intensive systems that excel across numerous domains, including autonomous agents, complex reasoning, and content generation. However, their computational demands present significant challenges for practical deployment and scalability. Caching serves as a critical optimization strategy for reusing computational results and reducing redundant processing. In this presentation, we will explore cache design from two key perspectives: application-layer agent caching for efficient retrieval and response generation, and model-layer architectural caching including quantized KV cache implementations for memory efficiency.

Shuhang Lin
Jun 13, 2025 Accelerating LLM Knowledge Learning and Unlearning Research via Unified Frameworks
Abstract

General purpose LLMs may struggle to answer knowledge-intensive questions grounded in specialized document collections. The first part of the presentation will discuss recent literature of injecting specialized knowledge into LLM parameters and propose an extensible framework for knowledge acquisition methods. The second part will cover OpenUnlearning, an extensible framework designed to benchmark both unlearning methods and evaluation metrics for LLMs, integrating 9 unlearning algorithms, 16 evaluation methods, and 450+ model checkpoints.

Wenlong Zhao
May 09, 2025 Towards Combinatorial Interpretability of Neural Computation
Abstract

This session will introduce a groundbreaking combinatorial approach to neural network interpretability, based on research from Nir Shavit and Micah Adler at MIT CSAIL, and Dan Alistarh at IST Austria. The approach focuses on the relationships within the network's learned weights and biases, offering a new way to decode how neural networks compute logic. We'll explore the Feature Channel Coding Hypothesis, which reveals how neural networks compute Boolean expressions by mapping input features to combinations of neurons, effectively forming codes for each feature.

Nir Shavit
May 02, 2025 Synthetic Data Generation via SDG-Hub
Abstract

In this talk, we will introduce SDG Hub, an open-source toolkit developed at Red Hat for customizing language models using synthetic data. We will begin by unpacking what synthetic data means in the context of LLMs, and how it enables model customization. The session will explore SDG Hub's core components: prompts, blocks, and flows, and demonstrate how users can compose, extend, or modify pipelines to fit specific tasks.

Abhishek Bhandwaldar, Shivchander Sudalairaj, Akash Srivastava
Apr 25, 2025 Continual Post-Training
Abstract

As large language models transition from static systems to dynamic components in real-world applications, a major challenge emerges: how can we teach them new tasks without making them forget what they've already learned? In this talk, we'll introduce a practical and theoretically grounded method for post-training continual learning that enables full-model fine-tuning, without increasing model size or compromising general capabilities. The key insight lies in constraining updates to carefully selected low-rank subspaces.

Nikhil Nayak, Krishnateja Killamsetty
Apr 18, 2025 State of LLM Compression from Research to Production
Abstract

LLMs have owned the stage, but with size comes complexity. This talk explores the evolving landscape of LLM Compression, from the latest SOTA research to real-world deployments. We'll break down the high-level effects of techniques such as quantization and sparsity and their tradeoffs between accuracy and performance. Additionally, we'll walk through the differences between academic and real-world benchmarks, what's ready for production today, what's sitting in the research lab, and what it will take to close the gap.

Mark Kurtz
Apr 11, 2025 Activation-Informed Merging of Large Language Models
Abstract

Join us for a talk on merging LLM experts, combining the parameters and embeddings of multiple fine-tuned LLMs to enhance performance across various tasks while maintaining computational efficiency. We introduce Activation-Informed Merging (AIM), a technique that integrates activation space information into the merging process to improve performance and robustness of merging methods. AIM is a flexible, complementary solution applicable to any existing merging method, motivated by continual learning and model compression principles, it is designed to preserve salient weights from pre-training to enhance merging outcomes with empirical evidence showing up to a 40% increase in benchmark performance.

Amin Heyrani Nobari & Kaveh Alim
Apr 04, 2025 Scaling Inference Time Scaling: Subspace-orthogonal KV Cache Quantization
Abstract

Modern reasoning models often require long responses to think through problems before arriving at a final answer. However, this incurs high latency and significant GPU memory demands. We introduce a novel KV cache quantization algorithm: Subspace-orthogonal KV cache quantization (SQuat). Our method is training-free, requires no calibration data, runs on-the-fly, and is grounded in a theoretical framework we developed. Empirically, it reduces GPU peak memory by 2.17x to 2.82x, improves throughput by 2.45x to 3.60x, and outperforms existing training-free KV cache quantization methods.

Hao Wang, Ligong Han
Feb 07, 2025 Probabilistic Inference Approach to Inference-Time Scaling of LLMs
Abstract

Large language models have achieved significant performance gains via scaling up model sizes and data. However, recent evidence suggests diminishing returns from such approaches, motivating scaling the computation spent at inference time. Existing inference-time scaling methods, usually with reward models, cast the task as a search problem, which tends to be vulnerable to reward hacking. In this paper, we instead cast inference-time scaling as a probabilistic inference task and leverage sampling-based techniques to explore the typical set of the state distribution. We propose a novel inference-time scaling approach by adapting particle-based Monte Carlo methods to this task. Our empirical evaluation demonstrates that our methods have a 4 to 16x better scaling rate over deterministic search counterparts on various challenging mathematical reasoning tasks. Using our approach, Qwen2.5-Math-1.5B-Instruct can surpass GPT-4o accuracy in only 4 rollouts, while Qwen2.5-Math-7B-Instruct scales to o1 level accuracy in only 32 rollouts.

Isha Puri
Jan 31, 2025 Practitioner's Guide to Instruction Tuning
Abstract

The rise of large language models has created a significant disparity: industrial labs can effectively fine-tune LLMs, while individual developers and small organizations face barriers due to limited resources. We present a comprehensive study on supervised fine-tuning of LLMs using instruction-tuning datasets spanning diverse knowledge domains and skills, focusing on small-sized LLMs (3B to 7B parameters). Key insights include: larger batch sizes paired with lower learning rates lead to improved performance; early-stage training dynamics such as lower gradient norms and higher loss values are strong indicators of better final performance, enabling early termination of sub-optimal runs; certain simplifications in hyperparameters like warmup steps and learning rate schedules do not compromise performance; and no significant difference was observed between phased and stacked training strategies, but stacked training is simpler and more sample efficient.

Aldo Pareja
Jan 17, 2025 Recent Advances in Bayesian Uncertainty Estimation for LLMs
Abstract

Recent advances in uncertainty estimation for Large Language Models during downstream adaptation have addressed key challenges of reliability and simplicity. However, existing Bayesian methods typically require multiple sampling iterations during inference, creating significant efficiency issues. We investigate eliminating the need for test-time sampling by distilling aligned confidence from a Bayesian LLM into a non-Bayesian student LLM, minimizing the divergence between their predictive distributions. This simple yet effective approach achieves N-times more efficient uncertainty estimation during testing, where N is the number of samples traditionally required. Our extensive experiments demonstrate that uncertainty estimation capabilities on training data can successfully generalize to unseen test data through our distillation technique, consistently producing results comparable to or better than state-of-the-art Bayesian LLMs.

Haizhou Shi
Jan 10, 2025 Scaling Preference Alignment: Overcome Human Data Bottlenecks with Density Ratios from Open-source LLMs
Abstract

Preference tuning relies on high-quality human preference data, which is often expensive and time-consuming to gather. We introduce DrSow (Density Ratio of Strong over Weak), a cost-effective method that eliminates the reliance on human annotation by leveraging off-the-shelf LLMs for preference data annotation. DrSow uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal. We evaluate DrSow across 221 different LLM pairs and empirically find a strong correlation between the performance gap of the paired models and the quality of the reward signal. We also introduce an end-to-end pipeline that customizes reward functions based on user query domains. With a pair of Mistral-7B models, DrSow achieves a RewardBench score of 82.6, outperforming the best trained reward functions from the same model class. Further, we preference-tune Llama-3-8B-Instruct using data annotated by DrSow, pushing it to achieve a 37.4% (+15.1%) win rate on ArenaHard and a 40.7% (+17.8%) win rate on length-controlled AlpacaEval 2.0.

GX Xu