Key Takeaways
- Evaluating multilingual LLM guardrails requires context-specific benchmarks, as language and policy phrasing can significantly impact safety assessments.
- Mozilla.ai's `any-guardrail` framework was used to test how guardrails perform with non-English responses and policies in a humanitarian context.
- Results indicate guardrail performance varies by language and policy phrasing, highlighting the need for careful, multilingual evaluation.
Ensuring the safety and reliability of Large Language Models (LLMs) demands evaluations that are granular, accounting for context, language, task, and domain. As developers increasingly build custom performance benchmarks, they are also adopting context-aware guardrails – tools designed to validate or constrain LLM inputs and outputs based on customized safety policies. A critical question arises: do these guardrails, often powered by LLMs themselves, exhibit or even amplify the multilingual inconsistencies common in LLM responses?
To investigate this, a project combined insights from Mozilla Foundation Senior Fellow Roya Pakzad's Multilingual AI Safety Evaluations and Daniel Nissani's `any-guardrail` framework developed at Mozilla.ai. Pakzad's work provided a humanitarian use case, scenario design, and policy development, while Nissani's open-source `any-guardrail` offered a unified, customizable interface for technical execution. The framework allows organizations to manage risks in domain-specific LLM deployments by making the guardrail layer as configurable as the models themselves. This experiment sought to answer how guardrails behave with non-English LLM responses, whether the language of the policy impacts guardrail decisions, and the safety implications for humanitarian use cases.
Methodology
The evaluation tested three guardrails within the `any-guardrail` framework: FlowJudge, Glider, and AnyLLM (using GPT-5-nano). All three support custom policy classification and provide explanations for their judgments. FlowJudge and Glider utilize Likert scales (1-5 and 0-4 respectively) to score responses against user-defined metrics and criteria, while AnyLLM provides binary TRUE/FALSE classifications for policy adherence.
The study developed 60 contextually grounded scenarios, with 30 in English and 30 identical, human-audited Farsi translations. These scenarios mirror real-world questions asylum seekers might pose or that adjudication officers might use. They address complex topics like war, political repression, and sanctions, requiring domain-specific and contextual knowledge beyond mere linguistic competence. For instance, a scenario involving financial sanctions on Iran necessitates understanding country-specific regulations for humanitarian exemptions.
Guardrail policies were developed in both English and Farsi, drawing from evaluation criteria in the Multilingual Humanitarian Response Eval (MHRE) dataset. These criteria cover actionability, factual accuracy, safety, privacy, tone, non-discrimination, and freedom of access to information. Sample policy requirements emphasized awareness of asylum seekers' real-world conditions, accuracy regarding regional policy variations, disclaimers for sensitive topics, and avoidance of discriminatory implications.
The experiment involved submitting the 60 scenarios to three LLMs (Gemini 2.5 Flash, GPT-4o, Mistral Small). The generated responses were then evaluated by FlowJudge, Glider, and AnyLLM against both English and Farsi policies. A human annotator, fluent in Farsi and experienced in migration contexts, manually annotated responses to establish a baseline. Discrepancies were defined as an absolute difference of 2 or more between guardrail scores for Farsi and English responses or policies, indicating substantive shifts in safety classification.
Results
Quantitative analysis revealed variations in guardrail performance. FlowJudge scores were generally permissive, with slight deviations from human judgments. Glider, conversely, scored more strictly than human evaluators. For both FlowJudge and Glider, English policy evaluations tended to align more closely with human scores, suggesting better comprehension of English policy instructions.
The AnyLLM (GPT-5-nano) guardrail, using binary classification, showed inconsistencies. Its reasoning for TRUE/FALSE labels was often unclear, particularly with Farsi prompts and responses. In some cases, AnyLLM struggled to differentiate between input prompts and LLM outputs, weakening its judgments. For example, in a Farsi scenario concerning Croatian laws on unlocking phones during asylum processes, the guardrail flagged a response as non-adherent, citing evasion tactics in the input text itself rather than the LLM's output.



