Testing AI Guardrails Across Languages

Researchers tested context-aware AI guardrails across English and Farsi in humanitarian scenarios, finding nuanced performance differences and highlighting the need for language-specific safety evaluations.

Illustration representing the concept of evaluating multilingual context-aware guardrails for AI safety.

The study explored the effectiveness of AI guardrails across different languages and contexts.

Key Takeaways

Effective AI safety requires context-aware guardrails tailored to specific languages, domains, and tasks.
Multilingual LLMs can exhibit inconsistencies, and this study investigates if guardrails inherit or amplify these issues.
Mozilla.ai's 'any-guardrail' framework was used to test context-aware guardrails on humanitarian scenarios in English and Farsi, revealing nuanced performance differences.

Developing robust AI safety measures means moving beyond one-size-fits-all solutions. As large language models (LLMs) become more integrated, the need for evaluation methods that are specific to context, language, task, and domain is critical. This is where context-aware guardrails come into play – tools designed to control or verify model inputs and outputs based on customized safety policies informed by specific contexts.

A significant challenge with LLMs is their multilingual inconsistency; models can provide different, lower-quality, or even contradictory information depending on the query language. The crucial question arises: do guardrails, which are often powered by LLMs themselves, maintain their integrity across languages, or do they introduce their own biases and inconsistencies?

To tackle this, researchers combined two Mozilla projects: Roya Pakzad's Multilingual AI Safety Evaluations and Daniel Nissani's development of the open-source any-guardrail framework. This collaboration focused on a humanitarian use case, leveraging Pakzad's expertise in scenario design and policy development and Nissani's technical implementation via any-guardrail. The framework offers a unified, customizable interface for various guardrail models, allowing organizations to manage risks in domain-specific AI deployments by making the guardrail layer as flexible as the AI models themselves. This research aimed to answer key questions: How do guardrails perform with non-English LLM responses? Does the language of the policy affect guardrail decisions? What are the safety implications for humanitarian aid scenarios?

Methodology

The study evaluated three guardrails within the any-guardrail framework: FlowJudge, Glider, and AnyLLM (GPT-5-nano). All are designed for custom policy classification and provide explanations for their judgments.

FlowJudge: A customizable guardrail assessing responses against user-defined metrics and scoring rubrics on a 1-5 Likert scale (1 being non-compliant/harmful, 5 being compliant/safe).
Glider: A rubric-based guardrail scoring responses on a 0-4 Likert scale (0 for non-compliant/unsafe, 4 for compliant/safe).
AnyLLM (GPT-5-nano): A guardrail using a general LLM to check responses against custom policy text, providing binary classification (TRUE for adherence, FALSE for violation).

Sixty contextually grounded scenarios were developed – 30 in English and 30 identical, human-audited Farsi translations. These scenarios represent questions asylum seekers might ask or that adjudication officers might use. They probe beyond linguistic fluency to test domain-specific knowledge, including complex topics like war, political repression, and financial regulations, crucial for effective guardrail design in humanitarian contexts.

Evaluating Multilingual, Context-Aware Guardrails: Evidence from a Humanitarian LLM Use Case

Guardrail policies were crafted in both English and Farsi, drawing from evaluation criteria in the Multilingual Humanitarian Response Eval (MHRE) dataset. These criteria span actionability, factual accuracy, safety, tone, non-discrimination, and freedom of information. Policy examples include requirements for awareness of real-world conditions facing asylum seekers, accuracy regarding policy variations, clear disclaimers for sensitive topics, and avoidance of discriminatory implications.

The experiment involved submitting each of the 60 scenarios to three LLMs (Gemini 2.5 Flash, GPT-4o, Mistral Small). The generated responses were then evaluated by FlowJudge, Glider, and AnyLLM (GPT-5-nano) against both English and Farsi policies. To establish a baseline, a human fluent in Farsi, with relevant experience, manually annotated the responses using the same Likert scale. Discrepancies were identified when the absolute difference between guardrail scores for Farsi and English responses, or between Farsi and English policy evaluations, was 2 or greater.

Results

Analysis of the results revealed distinct performance patterns across the tested guardrails and languages. The full results are available on the project's GitHub repository.

Quantitative Analysis

Score Analysis

FlowJudge Scores (5-pt Likert: 1 = non-compliant/harmful; 5 = compliant/safe)
Model/Prompt Language	Human Score	FlowJudge Score (English Policy)	FlowJudge Score (Farsi Policy)	Human–FlowJudge Score Difference¹
Gemini-2.5-Flash / Farsi	4.31	4.72	4.68	-0.39
Gemini-2.5-Flash / English	4.53	4.83	4.96	-0.36
GPT-4o / Farsi	3.66	4.56	4.63	-0.93
GPT-4o / English	3.93	4.16	4.56	-0.43
Mistral Small / Farsi	3.55	4.65	4.82	-1.18
Mistral Small / English	4.10	4.20	4.86	-0.43

Glider Scores (5-pt Likert: 0 = non-compliant/harmful; 4 = compliant/safe)
Model/Prompt Language	Human Score	Glider Score (English Policy)	Glider Score (Farsi Policy)	Human–Glider Score Difference
Gemini-2.5-Flash / Farsi	3.55	2	1.51	1.79
Gemini-2.5-Flash / English	3.62	2.62	2	1.31
GPT-4o / Farsi	2.93	1.3	1.43	1.56
GPT-4o / English	3.06	2.2	1.8	1.06
Mistral Small / Farsi	2.36	2.06	1.68	0.49
Mistral Small / English	3.3	2.46	1.26	1.44

¹ Human–FlowJudge/Glider Score Difference is the Average Human Score − (Average of FlowJudge/Glider Average Score for Farsi Policy and FlowJudge/Glider Average Score for English Policy)

AnyLLM (GPT-5-nano) Scores (FALSE = non-compliant/harmful; TRUE = compliant/safe)
Model/Prompt Language	Human Score	AnyLLM (English Policy)	AnyLLM (Farsi Policy)	Human–AnyLLM Score Difference
Gemini-2.5-Flash / Farsi	0/30 FALSE	2/30 FALSE	4/30 FALSE	+2 scenarios
Gemini-2.5-Flash / English	0/30 FALSE	3/30 FALSE	1/30 FALSE	-2 scenarios
GPT-4o / Farsi	3/30 FALSE	2/30 FALSE	4/30 FALSE	-1 scenario
GPT-4o / English	2/30 FALSE	4/30 FALSE	3/30 FALSE	-3 scenarios
Mistral Small / Farsi	2/30 FALSE	7/30 FALSE	5/30 FALSE	-4 scenarios
Mistral Small / English	0/30 FALSE	11/30 FALSE	3/30 FALSE	-8 scenarios

FlowJudge scores were generally more permissive than human judgments, with differences typically within 1 point. Glider, however, scored more strictly, often 1.0-1.5 points lower than human evaluations. When using English policy text, Glider scores aligned more closely with human scores across all models and response languages, suggesting better comprehension of English instructions.

While prompt language had minimal impact on FlowJudge's scores, this stability might reflect broad leniency rather than true cross-linguistic consistency. AnyLLM (GPT-5-nano), using binary classification, offered less granular insight than Likert scales. Its reasoning for True/False judgments was often unclear, particularly with Farsi inputs, sometimes conflating prompt and output, leading to weaker judgments. For instance, in a Farsi scenario concerning Croatian phone unlocking laws during asylum processes, AnyLLM struggled to clearly delineate policy adherence.

Testing AI Guardrails Across Languages

The study explored the effectiveness of AI guardrails across different languages and contexts.

Key Takeaways

Effective AI safety requires context-aware guardrails tailored to specific languages, domains, and tasks.

Multilingual LLMs can exhibit inconsistencies, and this study investigates if guardrails inherit or amplify these issues.

Mozilla.ai's 'any-guardrail' framework was used to test context-aware guardrails on humanitarian scenarios in English and Farsi, revealing nuanced performance differences.

FlowJudge Scores (5-pt Likert: 1 = non-compliant/harmful; 5 = compliant/safe)

Model/Prompt Language

Human Score

FlowJudge Score (English Policy)

FlowJudge Score (Farsi Policy)

Human–FlowJudge Score Difference¹

Gemini-2.5-Flash / Farsi

4.31

4.72

4.68

-0.39

Gemini-2.5-Flash / English

4.53

4.83

4.96

-0.36

GPT-4o / Farsi

3.66

4.56

4.63

-0.93

GPT-4o / English

3.93

4.16

4.56

-0.43

Mistral Small / Farsi

3.55

4.65

4.82

-1.18

Mistral Small / English

4.10

4.20

4.86

-0.43

Glider Scores (5-pt Likert: 0 = non-compliant/harmful; 4 = compliant/safe)

Model/Prompt Language

Human Score

Glider Score (English Policy)

Glider Score (Farsi Policy)

Human–Glider Score Difference

Gemini-2.5-Flash / Farsi

3.55

1.51

1.79

Gemini-2.5-Flash / English

3.62

2.62

1.31

GPT-4o / Farsi

2.93

1.3

1.43

1.56

GPT-4o / English

3.06

2.2

1.8

1.06

Mistral Small / Farsi

2.36

2.06

1.68

0.49

Mistral Small / English

3.3

2.46

1.26

1.44

AnyLLM (GPT-5-nano) Scores (FALSE = non-compliant/harmful; TRUE = compliant/safe)

Model/Prompt Language

Human Score

AnyLLM (English Policy)

AnyLLM (Farsi Policy)

Human–AnyLLM Score Difference

Gemini-2.5-Flash / Farsi

0/30 FALSE

2/30 FALSE

4/30 FALSE

+2 scenarios

Gemini-2.5-Flash / English

0/30 FALSE

3/30 FALSE

1/30 FALSE

-2 scenarios

GPT-4o / Farsi

3/30 FALSE

2/30 FALSE

4/30 FALSE

-1 scenario

GPT-4o / English

2/30 FALSE

4/30 FALSE

3/30 FALSE

-3 scenarios

Mistral Small / Farsi

2/30 FALSE

7/30 FALSE

5/30 FALSE

-4 scenarios

Mistral Small / English

0/30 FALSE

11/30 FALSE

3/30 FALSE

-8 scenarios

Testing AI Guardrails Across Languages

Key Takeaways

Methodology

Results

Quantitative Analysis

Score Analysis

AI Daily Digest

Testing AI Guardrails Across Languages

Key Takeaways

Methodology

Results

Quantitative Analysis

Score Analysis

AI Daily Digest