11:15 | <annevk> | Sigh, GitHub moderation is the worst. When you get spam on a PR you cannot delete the comment. If you block the person before editing the comment to remove the spam, you cannot edit the comment until you first unblock the person. And reporting the comment still takes you through a needlessly long software wizard I'm no longer interested in using. |
11:21 | <Luca Casonato> | I wish there was a feature like in Discord, where when you block someone it asks you if the all comments that person posted in the last 1hr/24hrs/7d should be deleted/hidden. That'd be really useful |
11:40 | <annevk> | Matrix/Element does that too |
14:19 | <Timo Tijhof> | We are reasonably confident in the "reference implementation" in jsdom/whatwg-url, which says that such URLs are parsed like browsers do. https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvU3BlY2lhbDpCbGFua3BhZ2U/ZnJvbT1lbiZ0bz1lcyZjYW1wYWlnbj1hcnRpY2xlLXJlY29tbWVuZGF0aW9uJnBhZ2U9QXBvbGxvJTk2U295dXpfVGVzdF9Qcm9qZWN0&base=YWJvdXQ6Ymxhbms= Thanks, that helps. With this, I traced it down to https://github.com/jsdom/whatwg-url/blob/a4cb13309246ca9ecf03404fdbf0d23ecaf114dd/lib/urlencoded.js#L10-L33 and https://github.com/jsdom/whatwg-url/blob/a4cb13309246ca9ecf03404fdbf0d23ecaf114dd/lib/percent-encoding.js#L20 which indeed is doing the critically important thing of ignoring things outside the known valid range and leaving them in-tact, whereas EcmaScript's decodeURIComponent throws an error for anything unexpected after a % sign. I did actually (briefly) consider using jsdom originaly, but found that it (understandably) kind of makes maximum use of other internals. It seems it goes down to the level of code points and then brings all of TextEncoder into it all as well, and UintArray, etc. I struggled to find a minimal way to port that over. I managed to make currently-failing test cases pass by replacing the blanket decodeURIComponent call with a loop over the string and then if the two chars after % are in-range, call decodeURIComponent on just But then found that this doesn't work for multi-byte characters, e.g. decoding I did try forcing it by using String.fromCharCode(parseInt(chunk.slice(1), 16));`, but that just produced garbage. I hate myself for this, but I managed to get all tests passing both old and new cases, by using a regex. That seemed simpler than trying to eagerly buffer up multiple %- chunks or re-implemenitng lower level code point mapping for multi-bytes. The regex looks only for valid in-range percent encoded chunks and eagerly repeats this so that we always pass decodeURI as many at once as possible.
I feel dirty now, but it seems to hold up in my testing so far. Curious if anyone here knows of a better minimal way to get the multi-byte code points to work in ES5 without "a lot of code" and/or can poke holes in the above, preferably holes that didn't already exist in the previous version of the FT polyfill (which simply calls decodeURIComponent on the whole thing). |
16:57 | <annevk> | Ms2ger 💉💉: why is it synthetic settings objects/realms and not shadow settings objects/realms? |
17:01 | <annevk> | Ms2ger 💉💉: I'd also still like some kind of answer to the question I raised in my overall review comment |
17:03 | <annevk> | I'll leave these questions on the PR. |
17:30 | <annevk> | Timo Tijhof: that was a fun read. I'm bad at reading regexp, maybe add a comment, but that looks reasonable. It does look like it will fall over if folks are not using UTF-8 bytes, which is allowed. But perhaps that's acceptable in your case. Might also want to denote that. |
17:51 | <Timo Tijhof> | Thanks, that gives me a bit more confidence. The use case that led me to discover this bug in the FT polyfill is actually very much related to bytes that are not UTF-8 bytes. Wikipedia uses UTF-8 for canonical URLs everywhere now and has for almost two decades now. But some of the earliest language editions (Specifically en.wikipedia, pl.wikipedia, and zh.wikipedia) used to use a different encoding, which our backend web server (MediaWiki PHP) continues to support to ensure URLs don't break. If the decoded string isn't valid UTF-8, it will on those wikis fallback to a legacy encoding. For en.wikipedia that is windows-1252, for pl.wikipedia iso-8859-2 and for zh.wikipeida/zh-hans windows-936 and zh/zh-hant windows-950. As per https://codesearch.wmcloud.org/core/?q=fallback8bitEncoding%20%3D For example, this URL is still supported: https://en.wikipedia.org/w/index.php?title=Apollo%96Soyuz_Test_Project While we don't have any need to read the "correct" value of the title parameter in JavaScript, we do have many other query params we sometimes read client-side, and so having Our downstream task (which was meant to be closed by using the URL API) - https://phabricator.wikimedia.org/T106244 My PR (with docs now): https://github.com/Financial-Times/polyfill-library/pull/1173. I'm still looking into improving the handling of incomplete multi-byte. We are unlikely to encounter those for any "valid" URLs, not even legacy ones, but it still seems better not to crash on those but do a replacement indeed like the standard does. |
17:54 | <annevk> | Timo Tijhof: so shouldn't you handle failure in that case and just return the input? |
17:54 | <annevk> | Timo Tijhof: and maybe rename to utf8_percent_decode |
17:54 | <Timo Tijhof> | Ack, yeah, I'm aware calling these bytes is wrong (though pre-existing name in the code). |
17:55 | <annevk> | Timo Tijhof: e.g., consider %FF which can never decode as UTF-8 |
17:57 | <Timo Tijhof> | Timo Tijhof: and maybe rename to utf8_percent_decode |
17:59 | <annevk> | Yeah I mainly meant that you don't seem to handle failure at all at the moment. Other than rethrowing. You're right that sequences such as that make it trickier. |
18:00 | <Timo Tijhof> | Ack yeah, I'll add a localised try-catch if nothing else. Good point. |