Skip to content

Conversation

@ben-schwen
Copy link
Member

@ben-schwen ben-schwen commented Oct 25, 2025

Closes #7336
Closes #1343
Closes #7333

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.96%. Comparing base (291a711) to head (e11d36d).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7388      +/-   ##
==========================================
- Coverage   98.97%   98.96%   -0.02%     
==========================================
  Files          87       87              
  Lines       16733    16741       +8     
==========================================
+ Hits        16561    16567       +6     
- Misses        172      174       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Oct 25, 2025

  • HEAD=tests_requires_utf8 stopped early for DT[by,verbose=TRUE] improved in #6296
  • HEAD=tests_requires_utf8 stopped early for isoweek improved in #7144
    Comparison Plot

Generated via commit e11d36d

Download link for the artifact containing the test results: ↓ atime-results.zip

Task Duration
R setup and installing dependencies 2 minutes and 54 seconds
Installing different package versions 45 seconds
Running and plotting the test cases 4 minutes and 26 seconds

@aitap
Copy link
Member

aitap commented Oct 25, 2025

Sorry if I'm late to note this, but wouldn't a more reliable test for this be the same thing as we currently use for ñ in test 2266? A test may require some symbols (ñ, ü, ん) to be representable in the native encoding. The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

@ben-schwen
Copy link
Member Author

The symbols may be represented using Unicode escapes (\uXXXX) as they currently do. If !identical(foo, enc2native(foo)), then the test must be skipped.

Good point. I have integrated this for the utf8_check.

@ben-schwen ben-schwen requested review from MichaelChirico and removed request for MichaelChirico December 28, 2025 15:03
x1 = c("al\u00E4", "ala", "\u00E4allc", "coep")
x2 = c("ala", "al\u00E4")
tstc = function(y) unlist(lapply(y, function(x) as.character(as.name(x))), use.names=FALSE)
test(1088.1, requires_utf8="\u00E4", chmatch(x1, x2), match(x1, x2)) # should not fallback to "match"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe easier to understand/write tests as requires_utf8=c(x1, x2)?

also, maybe better as a local() check here too, to reduce the visual noise of all the identical requires_utf8= inputs?

# for completness, include test from #2528 of non ascii LHS of := (it could feasibly fail in future due to something other than chmatch)

local(if (utf8_check("\u00E4")) {
eval(parse(text='
Copy link
Member

@MichaelChirico MichaelChirico Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need eval(parse())?

also, parse(keep.source=FALSE) for micro-improvement

Copy link
Member

@aitap aitap Dec 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the parser sees constructs like

data.table("\u00FCber" = c(1, 0, 0, 0, 0))

...it needs to construct a "language" (LANGSXP) call object where TAG(CADR(call)) is a symbol whose PRINTNAME is a CHARSXP saying über. That CHARSXP must be in the native encoding: requiring a single encoding makes it possible to compare pointers to SYMSXP values for equality (unlike CHARSXP where we have to test NEED2UTF8 and so on), and the native encoding was the default back before R had string encodings.

So when the parser tries to translate that Unicode string into the native encoding, it fails, emits a warning, and probably fails the following test, because the resulting string contains a substitution sequence:

LC_ALL=C R -q -s -e 'parse(text = r"{data.table("\u00FCber" = c(1, 0, 0, 0, 0))}")'
expression(data.table(`<U+00FC>ber` = c(1, 0, 0, 0, 0)))
Warning message:
In parse(text = "data.table(\"\\u00FCber\" = c(1, 0, 0, 0, 0))") :
  unable to translate '<U+00FC>ber' to native encoding

Without the runtime eval(parse(...)), this warning happens during source() with no way to avoid it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be good to write that down somewhere as a reminder, but I'm not sure the best place to do it while being (1) discoverable and (2) not repetitive.

maybe (1) document it near require_utf8 in R/test.data.table and (2) add a comment by each eval(parse()) like "see require_utf8 description"?

eval(parse(text='
DT = data.table(pas = c(1:5, NA, 6:10), good = c(1:10, NA))
setnames(DT, "pas", "p\u00E4s")
test(1092, requires_utf8="\u00E4", eval(parse(text="DT[is.na(p\u00E4s), p\u00E4s := 99L]")), data.table("p\u00E4s" = c(1:5, 99L, 6:10), good = c(1:10,NA)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nested eval(parse())... gnarly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add argument like 'requires_utf8' to test() Escape UTF-8 dependent tests Finish DtNonAsciiTests package and submit to CRAN

3 participants