21:21
<shu>
so i realized while implementing is/toWellFormed that, while web engines today with 1-byte and 2-byte strings can make isWellFormed constant time trivially for 1-byte strings, optimizing for 2-byte strings is actually fairly involved
21:21
<shu>
does anyone have any intuitions on the performance expectations and usage patterns?
21:22
<shu>
for 2-byte strings, that is
21:30
<guybedford>
@shu I'm not sure there are any major expectations for an initial implementation, but there certainly were hopes that this is something that could have optimization potential in time so that it could be used as a guard / check without too much perf overhead. How involved is it looking?
21:33
<Andreu Botella>
couldn't it be optimized opportunistically? ­maybe caching the result, and setting it at first if a string was created from UTF-8, for example
21:34
<littledan>
I'd sort of expect that its usage would ramp up over time (if it becomes widely used at all), so the optimizations that shu and Andreu are thinking of could wait
21:35
<guybedford>
The absolute bare minimum might be a boolean flag, acting as a cache of the result, that is invalidated on mutations
21:35
<guybedford>
so that at least there's some internal concept of the check that in theory could be an optimization target in future
21:35
<littledan>
It's hard to predict whether caching/precomputing vs optimizing the actual calculation would be more important
21:38
<littledan>
(I imagine Shu is thinking about the latter if it's seen as involved? One could imagine fancy SIMD stuff to search for the forbidden surrogates...)
21:44
<shu>
i'm thinking of the former, the latter i agree can definitely wait
21:44
<shu>
i'm mainly thinking about removing the cliff between 1-byte and 2-byte strings for isWellFormed
21:44
<shu>
caching a bit requires space to cache the bit, and that is what's involved
21:46
<guybedford>
no bits to spare...!? or lots of wiring?
21:46
<shu>
we're in a perpetual state of no bits to spare :)
21:46
<shu>
but in this case, mainly lots of wiring, yeah
21:47
<shu>
web engines today have a 1-bit encoding state: 1-byte or 2-byte
21:47
<shu>
you now need 2-bits with 4 states: 1-byte, 2-byte-of-unknown-well-formedness, 2-byte-well-formed, 2-byte-ill-formed
21:48
<shu>
you have something of a combinatorial explosion by propagating that through to every string type (rope/cons, slice/view, etc)
21:48
<shu>
you can compute it in sub-linear time for all the non-direct (i.e. don't own their own char buffer) string types, except for slices, which i don't know how to do
21:49
<shu>
i think you always have to re-iterate sliced strings in some cases
21:49
<guybedford>
is it not possible to add something as a placeholder that is basically just "well-formed | unknown" where the default is "unknown", and then it can be ignored by the other types for now?
21:50
<guybedford>
for slices, I think the thinking was that careful checking of the slice ends could be done for surrogate splits, if you know the string being sliced is valid
21:50
<shu>
that is pretty much the same amount of wiring, no?
21:50
<shu>
right, if the underlying of a slice is well-formed, you just need to check the beginning and the end
21:50
<shu>
if the underlying is not well-formed you have to re-iterate afaict
21:51
<guybedford>
well everything could ignore the well-formed bit and just recompute by default without that being wiring?
21:51
<guybedford>
or does even just adding the bit add overhead there?
21:51
<shu>
well, if everything just ignores it why add a bit? reserve it now you mean?
21:52
<shu>
for V8 there's also the grossness of external strings, where the embedder can choose to take ownership of the character buffer
21:53
<shu>
1-byte or 2-byte encodings are exposed for external string data, and well-formedness would need to be as well i guess?
21:54
<guybedford>
const s = "well-formed 😊"; s.isWellFormed() could in theory still benefit as some kind of baseline if the initial static string could pass the bit
21:54
<shu>
how is the initial static string well-formedness computed?
21:54
<guybedford>
yeah this affects webidl as well
21:54
<guybedford>
although again, you're way ahead - the initial expectation was just to keep it simple very much so
21:55
<shu>
yeah this can probably all wait pending real world usage data
21:55
<guybedford>
when the original encoding pass is taking place?
21:55
<shu>
ah, yeah, there are places where we would iterate strings anyway and we might be able to piggyback the well-formedness checks
21:56
<shu>
determining 1-byte is one of them, computing hash codes is another
21:56
<guybedford>
yeah, in theory external strings might want to set this stuff themselves if they can be trusted to provide that information
21:57
<shu>
i think this is all to say the "implementations can make this sub-linear and fast today without much work" only holds for 1-byte strings, which means we'll ship with a performance cliff
21:57
<shu>
which is fine (but isn't ideal) and wasn't something i realized
21:57
<Andreu Botella>
sub-linear? aren't 1-byte strings always well-formed?
21:57
<shu>
yes, that's what i mean, it's constant time for 1-byte strings
21:58
<shu>
but there were vague discussions of its being able to be sublinear for 2-byte as well
21:58
<shu>
and i kinda thought oh okay that sounds plausible without thinking about it much
22:00
<guybedford>
I guess the only concern is if this gets put as something that is considered a "cliff" and no one tackles it...?
22:01
<shu>
yeah, and that's a pretty generic concern
22:01
<guybedford>
having some idea of what the cliff is and how to approach it in future would certainly help... so it's interesting to hear what you think here
22:02
<shu>
(i'm more remarking on the interestingness of the problem space here, do not take what i've said as implementation concerns for shipping)
22:03
<shu>
well, the cliff for the naive implementation (constant time for 1-byte, flattening + linear time always for 2-byte) is fairly stark, so for large strings apps can definitely notice
22:04
<shu>
as for how to approach, the ideas we've talked about here all sound reasonable to try out
22:04
<shu>
that is, caching bit and piggybacking on other operations that already need to iterate the code units anyway
22:05
<guybedford>
glad to hear constant time doesn't sound untenable, which wasn't clear when these were purely hypothetical discussions
22:06
<shu>
amortized constant time?
22:06
<guybedford>
definitely makes sense to think about the integration points of this
22:06
<shu>
well i guess even for 1-byte it's amortized constant time
22:06
<guybedford>
Rust <> JS string sharing are cases where these expectations are important
22:08
<shu>
remind me again, rust strings are... wtf8?
22:09
<guybedford>
oh no, this is the point - they are strictly valid UTF8
22:09
<bakkot>
the more relevant thing is wasm strings, which will be list-of-usv, I believe
22:09
<bakkot>
list-of-usv implies no unpaired surrogates
22:09
<Andreu Botella>
you might be thinking of OsString (for filesystem paths, etc.) which in Windows is WTF-8
22:10
<shu>
yes let's see how far stringref gets
22:11
<guybedford>
also note Wasm components will also want this kind of guarantee for strings
22:11
<bakkot>
wasm components is the thing I meant yeah
22:13
<guybedford>
but in short, we don't need to worry too much about it initially, but it would be nice if there's a clear implementation path to speak to here
22:13
<shu>
i believe there is for v8
22:14
<shu>
given that it's non-trivial if you care about the performance in the limit, should confirm with others
22:14
<guybedford>
I mean there's no reason to be doing non-trivial work before christmas...
22:15
<guybedford>
but glad to hear that
22:15
<littledan>
I think Wasm strings integration into various toolchains are likely to be a lot more important for performance than this function
22:16
<shu>
true enough, i'm not sure how important the JS integration performance is for any performance-critical wasm workloads right now
22:16
<shu>
in the future i expect this to be important
22:28
<bakkot>
seems plausible that you might end up wanting to keep track of the "are there surrogates" bit for wasm integration anyway
23:51
<littledan>
I think deeper JS/Wasm integration and Wasm component model are sort of a similar order of magnitude away from being performance-critical issues (that is to say, probably not today)
23:52
<littledan>
but that doesn't mean we shouldn't design towards it