fix(eval): Support non-English languages in response_match_score #3923

AhrendsW · 2025-12-16T00:31:00Z

Summary

Fixes #3111 - Eval fails for non-English languages

The ROUGE-1 evaluation was returning Match score: 0 for non-English languages (Thai, Chinese, Arabic, Japanese, Korean, etc.) even when the evaluated text matched perfectly. This was because the Porter stemmer (used by default) only works correctly for English text.

Changes:

Added _is_latin_script() helper function that uses Python's built-in unicodedata module to detect if text is primarily Latin script
Modified _calculate_rouge_1_scores() to only enable stemming for Latin script text
Added comprehensive unit tests for multiple languages

How it works:

For Latin script text (English, Portuguese, French, German, Spanish, etc.): Stemming is enabled (preserves current behavior)
For non-Latin script text (Thai, Chinese, Arabic, Japanese, Korean, etc.): Stemming is disabled (fixes the bug)

Test plan

Added 15 tests for _is_latin_script() function covering:
- English, Portuguese, French, German (Latin scripts)
- Thai, Chinese, Arabic, Japanese, Korean (non-Latin scripts)
- Mixed scripts, numbers only, punctuation only
Added 22 tests for non-English ROUGE scoring covering:
- Identical text matching (should return 1.0)
- Partial overlap matching (should return > 0)
- No overlap matching (should return 0)
- English stemming verification (should still work)

Testing instructions:

pytest tests/unittests/evaluation/test_final_response_match_v1.py -v

Screenshots

The fix resolves the issue shown in #3111 where Thai text "สวัสดี" was returning score 0.

The ROUGE-1 evaluation was returning score 0 for non-English languages (Thai, Chinese, Arabic, etc.) because the Porter stemmer only works for English text. This fix: - Adds _is_latin_script() function to detect text script using unicodedata - Disables stemmer for non-Latin scripts while preserving it for English - Adds comprehensive tests for Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, and Spanish Fixes google#3111

gemini-code-assist · 2025-12-16T00:31:16Z

Summary of Changes

Hello @AhrendsW, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the ROUGE-1 evaluation system that previously prevented accurate scoring for non-English languages. By implementing a script detection mechanism, the system can now intelligently apply or disable language-specific stemming, ensuring that evaluation scores are reliable and meaningful across a diverse range of global languages. This enhancement significantly improves the utility of the evaluation metric for internationalized content.

Highlights

Internationalization Fix: The ROUGE-1 evaluation metric now correctly handles non-English languages by addressing an issue where the English-specific Porter stemmer caused incorrect 'Match score: 0' results for non-Latin scripts.
Script Detection Logic: A new helper function, _is_latin_script(), has been introduced. It uses Python's unicodedata module to determine if a given text is primarily composed of Latin script characters, allowing for intelligent conditional application of stemming.
Conditional Stemming: The _calculate_rouge_1_scores() function has been updated to only enable stemming when both the candidate and reference texts are detected as primarily Latin script. This preserves correct behavior for English and other Latin-based languages while disabling stemming for non-Latin scripts (e.g., Thai, Chinese, Arabic, Japanese, Korean) to ensure accurate scoring.
Comprehensive Testing: Extensive unit tests have been added, including 15 tests for the _is_latin_script() function covering various script types and edge cases, and 22 tests for non-English ROUGE scoring across multiple languages (Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, Spanish), as well as English stemming verification and mixed-script scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively addresses the issue of incorrect ROUGE-1 scores for non-English languages by introducing a mechanism to conditionally disable stemming. The new _is_latin_script helper function is a reasonable approach to detect script type, and its application in _calculate_rouge_1_scores correctly disables the Porter stemmer for non-Latin text. The accompanying unit tests are very comprehensive and cover a wide range of languages and scenarios. I have a few suggestions to make the test assertions more precise, which will strengthen the test suite against future regressions. Overall, this is a solid fix.

gemini-code-assist · 2025-12-16T00:32:47Z