Pretraining's Hidden Experts Revealed

The conventional view treats pretraining as yielding a single parameter vector, a mere starting point for subsequent adaptation. This perspective, however, overlooks a critical characteristic of large, well-pretrained models. According to research by Yulu Gan and Phillip Isola, published on arXiv, the outcome of pretraining is better understood as a distribution of parameters containing a rich landscape of task-specific experts.

The Expert Density Revelation in Large Models

While smaller models occupy a sparse region where experts are hard to find without structured optimization like gradient descent, the researchers observed a notable shift in large, well-pretrained models. Here, the density of task-experts increases dramatically. This means diverse, task-improving specialists are not anomalies but populate a substantial fraction of the neighborhood around the pretrained weights, making them discoverable through simpler means.

Random Sampling: A Surprisingly Effective Post-Training Strategy

Motivated by this insight, the authors explored a straightforward, fully parallel post-training method. This approach involves sampling $N$ random parameter perturbations, selecting the top $K$, and ensembling their predictions via majority vote. This technique, despite its simplicity, demonstrates competitive performance against established post-training AI model optimization methods such as PPO, GRPO, and ES for contemporary large-scale models. This suggests a powerful new avenue for leveraging pretrained models without the computational overhead of traditional fine-tuning or reinforcement learning.

Pretraining's Hidden Experts Revealed

The Expert Density Revelation in Large Models

Random Sampling: A Surprisingly Effective Post-Training Strategy

AI Daily Digest