Skip to content

Conversation

@AhrendsW
Copy link

Summary

Fixes #3111 - Eval fails for non-English languages

The ROUGE-1 evaluation was returning Match score: 0 for non-English languages (Thai, Chinese, Arabic, Japanese, Korean, etc.) even when the evaluated text matched perfectly. This was because the Porter stemmer (used by default) only works correctly for English text.

Changes:

  • Added _is_latin_script() helper function that uses Python's built-in unicodedata module to detect if text is primarily Latin script
  • Modified _calculate_rouge_1_scores() to only enable stemming for Latin script text
  • Added comprehensive unit tests for multiple languages

How it works:

  • For Latin script text (English, Portuguese, French, German, Spanish, etc.): Stemming is enabled (preserves current behavior)
  • For non-Latin script text (Thai, Chinese, Arabic, Japanese, Korean, etc.): Stemming is disabled (fixes the bug)

Test plan

  • Added 15 tests for _is_latin_script() function covering:

    • English, Portuguese, French, German (Latin scripts)
    • Thai, Chinese, Arabic, Japanese, Korean (non-Latin scripts)
    • Mixed scripts, numbers only, punctuation only
  • Added 22 tests for non-English ROUGE scoring covering:

    • Identical text matching (should return 1.0)
    • Partial overlap matching (should return > 0)
    • No overlap matching (should return 0)
    • English stemming verification (should still work)

Testing instructions:

pytest tests/unittests/evaluation/test_final_response_match_v1.py -v

Screenshots

The fix resolves the issue shown in #3111 where Thai text "สวัสดี" was returning score 0.

The ROUGE-1 evaluation was returning score 0 for non-English languages
(Thai, Chinese, Arabic, etc.) because the Porter stemmer only works
for English text.

This fix:
- Adds _is_latin_script() function to detect text script using unicodedata
- Disables stemmer for non-Latin scripts while preserving it for English
- Adds comprehensive tests for Thai, Chinese, Arabic, Japanese, Korean,
  Portuguese, French, German, and Spanish

Fixes google#3111
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AhrendsW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the ROUGE-1 evaluation system that previously prevented accurate scoring for non-English languages. By implementing a script detection mechanism, the system can now intelligently apply or disable language-specific stemming, ensuring that evaluation scores are reliable and meaningful across a diverse range of global languages. This enhancement significantly improves the utility of the evaluation metric for internationalized content.

Highlights

  • Internationalization Fix: The ROUGE-1 evaluation metric now correctly handles non-English languages by addressing an issue where the English-specific Porter stemmer caused incorrect 'Match score: 0' results for non-Latin scripts.
  • Script Detection Logic: A new helper function, _is_latin_script(), has been introduced. It uses Python's unicodedata module to determine if a given text is primarily composed of Latin script characters, allowing for intelligent conditional application of stemming.
  • Conditional Stemming: The _calculate_rouge_1_scores() function has been updated to only enable stemming when both the candidate and reference texts are detected as primarily Latin script. This preserves correct behavior for English and other Latin-based languages while disabling stemming for non-Latin scripts (e.g., Thai, Chinese, Arabic, Japanese, Korean) to ensure accurate scoring.
  • Comprehensive Testing: Extensive unit tests have been added, including 15 tests for the _is_latin_script() function covering various script types and edge cases, and 22 tests for non-English ROUGE scoring across multiple languages (Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, Spanish), as well as English stemming verification and mixed-script scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@adk-bot adk-bot added the eval [Component] This issue is related to evaluation label Dec 16, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the issue of incorrect ROUGE-1 scores for non-English languages by introducing a mechanism to conditionally disable stemming. The new _is_latin_script helper function is a reasonable approach to detect script type, and its application in _calculate_rouge_1_scores correctly disables the Porter stemmer for non-Latin text. The accompanying unit tests are very comprehensive and cover a wide range of languages and scenarios. I have a few suggestions to make the test assertions more precise, which will strengthen the test suite against future regressions. Overall, this is a solid fix.

Comment on lines +249 to +250
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions > 0 and < 1.0 are correct but not very precise. We can calculate the exact expected F-measure to make this test stronger. Given the candidate and reference texts, the F-measure should be exactly 0.4.

Calculation:

  • Candidate tokens: 5
  • Reference tokens: 5
  • Common tokens: 2
  • Precision = 2/5 = 0.4
  • Recall = 2/5 = 0.4
  • F-measure = 2 * (0.4 * 0.4) / (0.4 + 0.4) = 0.4
Suggested change
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
assert rouge_1_score.fmeasure == pytest.approx(0.4)

reference = "สวัสดี ค่ะ"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "สวัสดี" (1 out of 2 words)
assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion uses a relative tolerance rel=0.1, which is unnecessarily loose as the expected F-measure is exactly 0.5. Using pytest.approx(0.5) without a tolerance is more precise and makes the test stricter.

Suggested change
assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1)
assert rouge_1_score.fmeasure == pytest.approx(0.5)

Comment on lines +273 to +278
candidate = "今天 天气 很好" # "Today's weather is good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment on line 276, Should match "今天" and "很", is inconsistent with the candidate string "今天 天气 很好". The default tokenizer will treat "很好" as a single token, so only "今天" will be matched.

To align with the comment's intent and create a stronger test, I suggest splitting "很好" into "很 好" in the candidate string. This will result in an F-measure of 0.5.

Suggested change
candidate = "今天 天气 很好" # "Today's weather is good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
candidate = "今天 天气 很 好" # "Today's weather is very good"
reference = "今天 我 很 开心" # "Today I am happy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今天" and "很"
assert rouge_1_score.fmeasure == pytest.approx(0.5)

reference = "今日 は 仕事 が 忙しい です" # "Today work is busy"
rouge_1_score = _calculate_rouge_1_scores(candidate, reference)
# Should match "今日", "は", "が", "です"
assert rouge_1_score.fmeasure > 0.5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertion > 0.5 is correct but could be more precise. The expected F-measure for this case is exactly 2/3. Using pytest.approx(2 / 3) will make the test more robust against future changes.

Suggested change
assert rouge_1_score.fmeasure > 0.5
assert rouge_1_score.fmeasure == pytest.approx(2 / 3)

Comment on lines +337 to +338
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assertions > 0 and < 1.0 are correct but are not very specific. The expected F-measure can be calculated precisely as 2/3 for this test case. Using a more precise assertion makes the test stronger.

Calculation:

  • Candidate tokens: 3 (오늘, 날씨가, 좋습니다)
  • Reference tokens: 3 (오늘, 기분이, 좋습니다)
  • Common tokens: 2 (오늘, 좋습니다)
  • Precision = 2/3, Recall = 2/3
  • F-measure = 2/3
Suggested change
assert rouge_1_score.fmeasure > 0
assert rouge_1_score.fmeasure < 1.0
assert rouge_1_score.fmeasure == pytest.approx(2 / 3)

@ryanaiagent ryanaiagent self-assigned this Dec 16, 2025
@ryanaiagent ryanaiagent added the request clarification [Status] The maintainer need clarification or more information from the author label Dec 17, 2025
@ryanaiagent
Copy link
Collaborator

Hi @AhrendsW , Thank you for your contribution! We appreciate you taking the time to submit this pull request.
Can you fix lint errors by running autoformat.sh

The default rouge_scorer tokenizer only handles ASCII characters,
returning empty token lists for non-Latin scripts (Thai, Chinese,
Arabic, Japanese, Korean). This caused ROUGE scores of 0.0 even for
identical strings.

Changes:
- Add _UnicodeTokenizer class using Unicode-aware regex
- Use custom tokenizer for non-Latin scripts
- Fix import order per isort requirements
@ryanaiagent ryanaiagent added needs-review [Status] The PR is awaiting review from the maintainer and removed request clarification [Status] The maintainer need clarification or more information from the author labels Dec 17, 2025
@ryanaiagent
Copy link
Collaborator

Hi @seanzhou1023 , can you please review this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eval [Component] This issue is related to evaluation needs-review [Status] The PR is awaiting review from the maintainer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval fails for non-English languages

3 participants