-
Notifications
You must be signed in to change notification settings - Fork 3.2k
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
HTML API: Auto-escape JavaScript and JSON script tag contents when necessary #10635
Conversation
Test using WordPress PlaygroundThe changes in this pull request can previewed and tested using a WordPress Playground instance. WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser. Some things to be aware of
For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation. |
| * > then let the script block's type string for this script element be "text/javascript". | ||
| */ | ||
| $type_attr = $this->get_attribute( 'type' ); | ||
| $language_attr = $this->get_attribute( 'language' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this has the potential to be clearer if we did one of two things:
- early-abort when both the type and language attributes are missing.
- null-coalesce to some value like
''which would be semantically the same in these checks asnullbut allow us to treat the values as strings.
or even something like this…
$type = $this->get_attribute( 'type' );
$type = is_string( $type ) ?: ''; // combine `true` and `null` into ''.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this can be simpler. The different cases seem clear so we can also add some unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reviewed this and simplified or clarified it slightly, but I think it matches the specified behavior well and I'm not sure that further changes will improve things.
| /* | ||
| * > Otherwise, if the script block's type string is an ASCII case-insensitive match for | ||
| * > the string "module", then set el's type to "module". | ||
| * | ||
| * A module is evaluated as JavaScript. | ||
| */ | ||
| case 'module': | ||
| return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this as-is, it allows quotes referencing different specifications to remain separate.
Co-authored-by: Weston Ruter <westonruter@gmail.com>
| * - \f | ||
| * - " " (U+0020 SPACE) | ||
| * - / | ||
| * - > |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we’re not checking for these state transitions. in the code following, we’re checking for the simplified transition, akin to the simplified diagram
I think it would be helpful to match the discussion of the approach with the comment. that is, I don’t know if it’s particularly helpful to spell out all of the characters when terser descriptions are warranted. e.g.
* If the plaintext contains any sequences which would could be interpreted as
* SCRIPT opening or closing tags, then it is sufficient to escape these. This
* prevents getting into the dangerous double-escape state. Technically, what
* matters is not the presence of a full or actual SCRIPT tag, but the start of
* a tag containing the "SCRIPT" tag name.
*
* @see URL
*/
if ( false !== stripos( ... ) || false !== stripos( ... ) ) {
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those lines capture all of the relevant state transitions, it's really about these minimal transitions that either close the script or move to double-escaped:
---
config:
---
stateDiagram-v2
state "script data" as ScriptData
state "escaped" as Escaped
state "double escaped" as DoubleEscaped
state "Close script" as CloseScript
Escaped --> DoubleEscaped : #60;script[ \t\f\r\n/>]
ScriptData --> CloseScript : #60;/script[ \t\f\r\n/>]
Escaped --> CloseScript : #60;/script[ \t\f\r\n/>]
The idea being that if those transitions are prevented then the script contents cannot break the HTML structure.
I agree the documentation here needs to be reviewed broadly. It's important to document the escaping well for posterity to understand why this is implemented as it is.
| * | ||
| * This escaping strategy strikes will make ALL JavaScript safe to embed in | ||
| * HTML in a way that is completely transparent in most cases. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you happy with this comment? I think some things are very elaborate and technical, and worded in ways which could use some refinement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was one of the last things I added and could certainly use revision.
| * | ||
| * There are a few exceptions where the escaped form can be detected: | ||
| * | ||
| * - The escaped form would appear in any JavaScript comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“There are a few exceptions where the escaped form can be detected:” is passive and feels indirect. Something more punching could go further with less.
* This escaping cannot be done everywhere in JavaScript:
*
* - Comments are not interpreted, meaning the escape sequences are visible, but
* only when reviewing the source code itself.
* - `String.raw()` and tagged template literal strings work on unescaped values.
* - The `source` property of a RegExp object returns unescaped strings.
*
* To avoid escaping in these situations it’s necessary to avoid presenting the
* text which appears like a SCRIPT tag, for example, by splitting it into two
* pieces and combining them.
*
* Example:
*
* console.log( String.raw`</\u0073cript>` ); // </\u0073cript>
* console.log( String.raw`</scr` + String.raw`ipt>` ); // </script>|
I've created #10668 to align STYLE tag handling with this approach. |
| * Other types of script tags cannot be escaped safely because there is | ||
| * no general escaping mechanism for arbitrary types of content. | ||
| */ | ||
| return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we think about changing the design of the
is_javascript_script_tag()methods, we could expose something like a MIME type and allow custom filtering of escapes to allow plugins to provide their own escaping routines rather than reject the updates.for now I see no reason to block progress for this, especially since we already rejected these contents before, but I think we can imagine a situation where someone can replace their SCRIPT contents and we can perform the
<scriptand</scriptchecks after the filter, and only then reject if nothing was able to escape
| return strtr( | ||
| $text, | ||
| array( '<' => '\\u003C' ) | ||
| ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linking to conversation about this escaping choice:
this is certainly easy, but I wonder what the relative value is compared to performing the same escape as with the JavaScript elements.
my guess is that
<scriptand</scriptoccur far less frequently within JSON than<does, so if we perform the same escaping then we might expect considerably fewer modifications to the JSON content.from a performance standpoint I expect the picture to be complicated and hard to quantify, as we’re comparing a relatively simple pair of string searches against string allocations and string building.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this as well. Ultimately, this seemed very reasonable, strtr is supposed to have good performance characteristics and there are no cases to worry about with <. This is close to the behavior of json_encode() with JSON_HEX_TAG, although that also escapes >.
Some other considerations:
- We could JSON decode/encode with the appropriate flags. That's likely expensive and can modify some types of values, I think it's best avoided.
- We could use the same escaping as JavaScript.
preg_replace_callbackis likely more expensive, but I don't have data on that. - I considered doing
strtrwith all the case variants, but the list ofstrtrcase variants would require something like 128 different cases for case-insensitive<scriptand</script, and that seems excessive.
There is a middle ground that is worth considering. This note on strtr() is relevant (bold mine):
If given two arguments, the second should be an array in the form
array('from' => 'to', ...)…The keys and the values may have any length… However, this function will be the most efficient when all the keys have the same size.
Every character added to the replacement becomes more targeted. We could use additional characters in the strtr pattern with something like the following:
function escape_json_script_contents_2( string $text ): string {
return strtr(
$text,
array(
'<s' => '\\u003Cs',
'<S' => '\\u003CS',
'</' => '\\u003C/',
)
);
}
function escape_json_script_contents_3( string $text ): string {
return strtr(
$text,
array(
'<sc' => '\\u003Csc',
'<Sc' => '\\u003CSc',
'<sC' => '\\u003CsC',
'<SC' => '\\u003CSC',
'</s' => '\\u003C/s',
'</S' => '\\u003C/S',
)
);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these seem better than how it stands now as I think the occurrence rate of <s, <S, and </ will also be much lower than of < by itself.
I wouldn’t want us to get caught up in the weeds of performance without any metrics to guide us. We can keep in mind that these operations are going to happen relatively infrequently during a page render. Maybe they happen ten times, vs. something like tag parsing which might happen thousands of times.
So the preg_replace_callback() seems just fine to me, even though personally I think both of these methods are probably non-ideal. I still prefer inching forward on stripos() followed by strspn().
The thing is, without data we just don’t know and so who cares if we are potentially slower on a minor part of the page render. If our intention is clear we can follow-up with improvements where we can make them soundly.
Ultimately here my intuition says that the difference between the basic strtr() and the stripos()+strspn() replacements is going to fall down to the relative improvement strtr() brings against the higher rate of modifying the string. I would expect < to appear in a reasonably high fraction of strings, whereas <script and variants are unlikely to regularly appear. Both approaches are scanning through the string to perform a search, and while strtr() is surely efficient, not allocating and not performing replacement is surely more efficient.
I would caution here against piling up verbose source code to eke out minor performance improvements when we don’t know if they matter and when they obscure intent. None of your suggestions obscure intent, but I think they may also not be essential.
| * | ||
| * † = Case insensitive 'script' followed by one of ' \t\f\r\n/>' | ||
| * | ||
| * The original source of this graph is included at the bottom of this file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to be updated.
| } | ||
|
|
||
| /** | ||
| * Escape JavaScript script tag contents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll need to document that this is suitable for JavaScript and JSON escaping.
| return 'json'; | ||
|
|
||
| /** @todo Rely on a full MIME parser for determining JSON content. */ | ||
| case 'application/json': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also good to include here initially:
| case 'application/json': | |
| case 'application/json': | |
| case 'application/ld+json': | |
| case 'application/manifest+json': |
These and other such MIME types would be handled by the parser.

The HTML API currently rejects script tag contents that may be dangerous. This is a proposal to detect JavaScript and JSON script tags and automatically escape contents when necessary.
<scriptor</script(case-insensitive) is found.In JSON
<is replaced with\u003C. This eliminates the problematic strings and aligns with the approach described in #63851 and applied in r60681.This is proposed as a simple character replacement with
strtr. This should be highly performant. A less invasive replacement could be done to only replace<in<scriptor</scriptwhere it's really necessary. This would preserve more of the JSON string, but likely at the cost of performance. It would require either a regular expression with case-insensitive matching (see JavaScript example).In JavaScript
<scriptand</script(followed by a necessary tag termination character\t\n\r\f/>) thesis replaced with its Unicode escape. This should remain valid in all contexts where the text can appear and maintain identical behavior in all except a few edge cases (see ticket or quoted section below for full explanation and caveats).From the ticket:
Trac ticket: https://core.trac.wordpress.org/ticket/64419
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.