11:15
<annevk>
Sigh, GitHub moderation is the worst. When you get spam on a PR you cannot delete the comment. If you block the person before editing the comment to remove the spam, you cannot edit the comment until you first unblock the person. And reporting the comment still takes you through a needlessly long software wizard I'm no longer interested in using.
11:21
<Luca Casonato>
I wish there was a feature like in Discord, where when you block someone it asks you if the all comments that person posted in the last 1hr/24hrs/7d should be deleted/hidden. That'd be really useful
11:40
<annevk>
Matrix/Element does that too
14:19
<Timo Tijhof>
We are reasonably confident in the "reference implementation" in jsdom/whatwg-url, which says that such URLs are parsed like browsers do. https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvU3BlY2lhbDpCbGFua3BhZ2U/ZnJvbT1lbiZ0bz1lcyZjYW1wYWlnbj1hcnRpY2xlLXJlY29tbWVuZGF0aW9uJnBhZ2U9QXBvbGxvJTk2U295dXpfVGVzdF9Qcm9qZWN0&base=YWJvdXQ6Ymxhbms=

Thanks, that helps. With this, I traced it down to https://github.com/jsdom/whatwg-url/blob/a4cb13309246ca9ecf03404fdbf0d23ecaf114dd/lib/urlencoded.js#L10-L33 and https://github.com/jsdom/whatwg-url/blob/a4cb13309246ca9ecf03404fdbf0d23ecaf114dd/lib/percent-encoding.js#L20 which indeed is doing the critically important thing of ignoring things outside the known valid range and leaving them in-tact, whereas EcmaScript's decodeURIComponent throws an error for anything unexpected after a % sign.

I did actually (briefly) consider using jsdom originaly, but found that it (understandably) kind of makes maximum use of other internals. It seems it goes down to the level of code points and then brings all of TextEncoder into it all as well, and UintArray, etc. I struggled to find a minimal way to port that over.

I managed to make currently-failing test cases pass by replacing the blanket decodeURIComponent call with a loop over the string and then if the two chars after % are in-range, call decodeURIComponent on just But then found that this doesn't work for multi-byte characters, e.g. decoding %7F%C3%BF is only accepted when done at once, not when done chunk by chunk. The URL spec doesn't say that, but then again, the URL spec doesnt say to use decodeURI of course anyway, it says to work on bytes and code points and not return to UTF-8 strings until all the way done, which I can't do within ES5 I think, or at least not without a ton of other emulation code.

I did try forcing it by using String.fromCharCode(parseInt(chunk.slice(1), 16));`, but that just produced garbage.

I hate myself for this, but I managed to get all tests passing both old and new cases, by using a regex. That seemed simpler than trying to eagerly buffer up multiple %- chunks or re-implemenitng lower level code point mapping for multi-bytes.

The regex looks only for valid in-range percent encoded chunks and eagerly repeats this so that we always pass decodeURI as many at once as possible.

	function percent_decode(bytes) {
			return bytes.replace(/((%[0-9A-Fa-f]{2})*)/g, function(m, p1) {
				return decodeURIComponent(p1);
			});
	
	}

I feel dirty now, but it seems to hold up in my testing so far. Curious if anyone here knows of a better minimal way to get the multi-byte code points to work in ES5 without "a lot of code" and/or can poke holes in the above, preferably holes that didn't already exist in the previous version of the FT polyfill (which simply calls decodeURIComponent on the whole thing).

16:57
<annevk>
Ms2ger 💉💉: why is it synthetic settings objects/realms and not shadow settings objects/realms?
17:01
<annevk>
Ms2ger 💉💉: I'd also still like some kind of answer to the question I raised in my overall review comment
17:03
<annevk>
I'll leave these questions on the PR.
17:30
<annevk>
Timo Tijhof: that was a fun read. I'm bad at reading regexp, maybe add a comment, but that looks reasonable. It does look like it will fall over if folks are not using UTF-8 bytes, which is allowed. But perhaps that's acceptable in your case. Might also want to denote that.
17:51
<Timo Tijhof>

Thanks, that gives me a bit more confidence. The use case that led me to discover this bug in the FT polyfill is actually very much related to bytes that are not UTF-8 bytes.

Wikipedia uses UTF-8 for canonical URLs everywhere now and has for almost two decades now. But some of the earliest language editions (Specifically en.wikipedia, pl.wikipedia, and zh.wikipedia) used to use a different encoding, which our backend web server (MediaWiki PHP) continues to support to ensure URLs don't break. If the decoded string isn't valid UTF-8, it will on those wikis fallback to a legacy encoding. For en.wikipedia that is windows-1252, for pl.wikipedia iso-8859-2 and for zh.wikipeida/zh-hans windows-936 and zh/zh-hant windows-950. As per https://codesearch.wmcloud.org/core/?q=fallback8bitEncoding%20%3D

For example, this URL is still supported: https://en.wikipedia.org/w/index.php?title=Apollo%96Soyuz_Test_Project

While we don't have any need to read the "correct" value of the title parameter in JavaScript, we do have many other query params we sometimes read client-side, and so having new mw.Uri (our in-house library) or new URL (polyfiled) throw unconditionally on those pages is a problem. It's fine if we can't read searchParams.get('title') correctly, so long as we can still do other things with it.

Our downstream task (which was meant to be closed by using the URL API) - https://phabricator.wikimedia.org/T106244

My PR (with docs now): https://github.com/Financial-Times/polyfill-library/pull/1173.

I'm still looking into improving the handling of incomplete multi-byte. We are unlikely to encounter those for any "valid" URLs, not even legacy ones, but it still seems better not to crash on those but do a replacement indeed like the standard does.

17:54
<annevk>
Timo Tijhof: so shouldn't you handle failure in that case and just return the input?
17:54
<annevk>
Timo Tijhof: and maybe rename to utf8_percent_decode
17:54
<Timo Tijhof>
Ack, yeah, I'm aware calling these bytes is wrong (though pre-existing name in the code).
17:55
<annevk>
Timo Tijhof: e.g., consider %FF which can never decode as UTF-8
17:57
<Timo Tijhof>
Timo Tijhof: and maybe rename to utf8_percent_decode
Hm.. so the issue there is that if you have %FF%20, my eager regex will feed them all (as it should for multibyte) and then fail in its entirely still. Returning that as literal or as replacement character would both be different from the spec. But I don't mind a bit of difference with the spec. This is meant to be a relatively loose polyfill I think, and not a full virtual machine :)
17:59
<annevk>
Yeah I mainly meant that you don't seem to handle failure at all at the moment. Other than rethrowing. You're right that sequences such as that make it trickier.
18:00
<Timo Tijhof>
Ack yeah, I'll add a localised try-catch if nothing else. Good point.