02:21
<rbuckton>

I just ran across a strange case while writing additional tests for RegExp Modifiers. I've found exactly two cases where /\b/u and /\b/ui disagree for the same character:

  • U+017f - ſ LATIN SMALL LETTER LONG S
  • U+212a - K KELVIN SIGN

A quick test of the same patterns and inputs in C# shows no disagreement, so its not clear to me if this is expected or possibly a bug in \b.

02:31
<rbuckton>
possibly having to do with how Unicode case folding for those characters produces an ASCII character. It just seems strange to have something that is not considered a word character when preserving case, but is considered a word character when ignoring case.
02:39
<bakkot>
the original sin here is that \b and \w are not unicode-aware even in u mode
02:40
<bakkot>
this behavior follows immediately from that: U+017f is not an ascii word character, but it case-folds to s, which is, and i means that the regex operates on case-folded characters
02:40
<bakkot>
the decision to make \b and \w not unicode-aware predates me, unfortunately, so I cannot tell you why this is. it does seem... bad.
02:40
<bakkot>
(\d too but that one matters a lot less.)
03:19
<Justin Ridgewell>
Time to introduce a w flag for very very unicode mode?
03:27
<bakkot>
we actually did specifically discuss and reject the possibility of making \b etc unicode-aware in v-mode https://github.com/tc39/notes/blob/2fccc7f7a38201354a007394ab867ec7b245b464/meetings/2021-08/aug-31.md#regexp-set-notation--properties-of-strings
04:59
<Justin Ridgewell>

JRL: Also voicing support, I would not change these shorthands.

I do not remember this

05:31
<rbuckton>
I think waldemar's concern at the time was that changing \b, \w, and \d shouldn't be tied to the mode that adds set notation. We'd need to opt in either with a new mode or a {u} suffix. Either are fine so long as the new mode could be included in the modifiers list, i.e., \b{u} or (?w:\b) (or whatever flag we'd use) would work for those cases.
05:38
<rbuckton>
Oh, I guess I mentioned modifiers during that discussion as well.
15:40
<Richard Gibson>
the decision to make \b and \w not unicode-aware predates me, unfortunately, so I cannot tell you why this is. it does seem... bad.

https://github.com/tc39/proposal-regexp-unicode-property-escapes/issues/22#issuecomment-279930140

There was a pre-ES6 proposal to change the meaning of \w, \d, and \b in Unicode mode. It was ultimately rejected out of fear it would hurt adoption of the u flag.

(https://github.com/tc39/proposal-regexp-unicode-property-escapes/issues/22 is the [failed] attempt to make those escapes Unicode-aware under the v flag)

15:57
<shu>
who can add new members to the tc39 organization on GH?
15:57
<shu>
i'd like to add a V8 bot account for the purposes of test262 2-way sync. i can add the account to the right teams but first it has to be part of the tc39 organization, apparently
16:07
<ljharb>
done.