-
Notifications
You must be signed in to change notification settings - Fork 2.7k
fix(eval): Support non-English languages in response_match_score #3923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The ROUGE-1 evaluation was returning score 0 for non-English languages (Thai, Chinese, Arabic, etc.) because the Porter stemmer only works for English text. This fix: - Adds _is_latin_script() function to detect text script using unicodedata - Disables stemmer for non-Latin scripts while preserving it for English - Adds comprehensive tests for Thai, Chinese, Arabic, Japanese, Korean, Portuguese, French, German, and Spanish Fixes google#3111
Summary of ChangesHello @AhrendsW, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical bug in the ROUGE-1 evaluation system that previously prevented accurate scoring for non-English languages. By implementing a script detection mechanism, the system can now intelligently apply or disable language-specific stemming, ensuring that evaluation scores are reliable and meaningful across a diverse range of global languages. This enhancement significantly improves the utility of the evaluation metric for internationalized content. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the issue of incorrect ROUGE-1 scores for non-English languages by introducing a mechanism to conditionally disable stemming. The new _is_latin_script helper function is a reasonable approach to detect script type, and its application in _calculate_rouge_1_scores correctly disables the Porter stemmer for non-Latin text. The accompanying unit tests are very comprehensive and cover a wide range of languages and scenarios. I have a few suggestions to make the test assertions more precise, which will strengthen the test suite against future regressions. Overall, this is a solid fix.
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions > 0 and < 1.0 are correct but not very precise. We can calculate the exact expected F-measure to make this test stronger. Given the candidate and reference texts, the F-measure should be exactly 0.4.
Calculation:
- Candidate tokens: 5
- Reference tokens: 5
- Common tokens: 2
- Precision = 2/5 = 0.4
- Recall = 2/5 = 0.4
- F-measure = 2 * (0.4 * 0.4) / (0.4 + 0.4) = 0.4
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| assert rouge_1_score.fmeasure == pytest.approx(0.4) |
| reference = "สวัสดี ค่ะ" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "สวัสดี" (1 out of 2 words) | ||
| assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion uses a relative tolerance rel=0.1, which is unnecessarily loose as the expected F-measure is exactly 0.5. Using pytest.approx(0.5) without a tolerance is more precise and makes the test stricter.
| assert rouge_1_score.fmeasure == pytest.approx(0.5, rel=0.1) | |
| assert rouge_1_score.fmeasure == pytest.approx(0.5) |
| candidate = "今天 天气 很好" # "Today's weather is good" | ||
| reference = "今天 我 很 开心" # "Today I am happy" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "今天" and "很" | ||
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment on line 276, Should match "今天" and "很", is inconsistent with the candidate string "今天 天气 很好". The default tokenizer will treat "很好" as a single token, so only "今天" will be matched.
To align with the comment's intent and create a stronger test, I suggest splitting "很好" into "很 好" in the candidate string. This will result in an F-measure of 0.5.
| candidate = "今天 天气 很好" # "Today's weather is good" | |
| reference = "今天 我 很 开心" # "Today I am happy" | |
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | |
| # Should match "今天" and "很" | |
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| candidate = "今天 天气 很 好" # "Today's weather is very good" | |
| reference = "今天 我 很 开心" # "Today I am happy" | |
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | |
| # Should match "今天" and "很" | |
| assert rouge_1_score.fmeasure == pytest.approx(0.5) |
| reference = "今日 は 仕事 が 忙しい です" # "Today work is busy" | ||
| rouge_1_score = _calculate_rouge_1_scores(candidate, reference) | ||
| # Should match "今日", "は", "が", "です" | ||
| assert rouge_1_score.fmeasure > 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assert rouge_1_score.fmeasure > 0 | ||
| assert rouge_1_score.fmeasure < 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions > 0 and < 1.0 are correct but are not very specific. The expected F-measure can be calculated precisely as 2/3 for this test case. Using a more precise assertion makes the test stronger.
Calculation:
- Candidate tokens: 3 (
오늘,날씨가,좋습니다) - Reference tokens: 3 (
오늘,기분이,좋습니다) - Common tokens: 2 (
오늘,좋습니다) - Precision = 2/3, Recall = 2/3
- F-measure = 2/3
| assert rouge_1_score.fmeasure > 0 | |
| assert rouge_1_score.fmeasure < 1.0 | |
| assert rouge_1_score.fmeasure == pytest.approx(2 / 3) |
|
Hi @AhrendsW , Thank you for your contribution! We appreciate you taking the time to submit this pull request. |
The default rouge_scorer tokenizer only handles ASCII characters, returning empty token lists for non-Latin scripts (Thai, Chinese, Arabic, Japanese, Korean). This caused ROUGE scores of 0.0 even for identical strings. Changes: - Add _UnicodeTokenizer class using Unicode-aware regex - Use custom tokenizer for non-Latin scripts - Fix import order per isort requirements
|
Hi @seanzhou1023 , can you please review this. |
Summary
Fixes #3111 - Eval fails for non-English languages
The ROUGE-1 evaluation was returning
Match score: 0for non-English languages (Thai, Chinese, Arabic, Japanese, Korean, etc.) even when the evaluated text matched perfectly. This was because the Porter stemmer (used by default) only works correctly for English text.Changes:
_is_latin_script()helper function that uses Python's built-inunicodedatamodule to detect if text is primarily Latin script_calculate_rouge_1_scores()to only enable stemming for Latin script textHow it works:
Test plan
Added 15 tests for
_is_latin_script()function covering:Added 22 tests for non-English ROUGE scoring covering:
Testing instructions:
Screenshots
The fix resolves the issue shown in #3111 where Thai text "สวัสดี" was returning score 0.