Locai Labs, a UK-based AI firm, has thrown a wrench into the frontier model race, announcing the open-source release of Locai L1-Large. The company claims its new model, built on the Qwen3 235B Instruct architecture, has surpassed leading proprietary models—including GPT-5, Claude 4.5 Sonnet, and Gemini Flash 2.5—on the critical Arena Hard v2 benchmark for conversational alignment and human preference.
The achievement is notable not just for the performance metrics, but for *how* Locai L1-Large was trained. Locai Labs developed a new post-training framework called "Forget-Me-Not," which allows the model to self-improve on downstream tasks without relying on expensive, human-labeled preference data.
Forget-Me-Not combines the concepts of experience replay (mixing in old data to prevent catastrophic forgetting) and self-improvement (where the model generates and grades its own training data). This process was targeted at aligning the model toward broad goals like helpfulness, conciseness, and factuality. The results appear to validate the methodology: Locai L1-Large showed a 2.1% improvement on Arena Hard v2 over the base Qwen model and delivered a 17% improvement on the AgentHarm safety benchmark, suggesting the self-judgment process effectively filtered out harmful outputs.
