21:21 | <shu> | so i realized while implementing is/toWellFormed that, while web engines today with 1-byte and 2-byte strings can make isWellFormed constant time trivially for 1-byte strings, optimizing for 2-byte strings is actually fairly involved |
21:21 | <shu> | does anyone have any intuitions on the performance expectations and usage patterns? |
21:22 | <shu> | for 2-byte strings, that is |
21:30 | <guybedford> | @shu I'm not sure there are any major expectations for an initial implementation, but there certainly were hopes that this is something that could have optimization potential in time so that it could be used as a guard / check without too much perf overhead. How involved is it looking? |
21:33 | <Andreu Botella> | couldn't it be optimized opportunistically? maybe caching the result, and setting it at first if a string was created from UTF-8, for example |
21:34 | <littledan> | I'd sort of expect that its usage would ramp up over time (if it becomes widely used at all), so the optimizations that shu and Andreu are thinking of could wait |
21:35 | <guybedford> | The absolute bare minimum might be a boolean flag, acting as a cache of the result, that is invalidated on mutations |
21:35 | <guybedford> | so that at least there's some internal concept of the check that in theory could be an optimization target in future |
21:35 | <littledan> | It's hard to predict whether caching/precomputing vs optimizing the actual calculation would be more important |
21:38 | <littledan> | (I imagine Shu is thinking about the latter if it's seen as involved? One could imagine fancy SIMD stuff to search for the forbidden surrogates...) |
21:44 | <shu> | i'm thinking of the former, the latter i agree can definitely wait |
21:44 | <shu> | i'm mainly thinking about removing the cliff between 1-byte and 2-byte strings for isWellFormed |
21:44 | <shu> | caching a bit requires space to cache the bit, and that is what's involved |
21:46 | <guybedford> | no bits to spare...!? or lots of wiring? |
21:46 | <shu> | we're in a perpetual state of no bits to spare :) |
21:46 | <shu> | but in this case, mainly lots of wiring, yeah |
21:47 | <shu> | web engines today have a 1-bit encoding state: 1-byte or 2-byte |
21:47 | <shu> | you now need 2-bits with 4 states: 1-byte, 2-byte-of-unknown-well-formedness, 2-byte-well-formed, 2-byte-ill-formed |
21:48 | <shu> | you have something of a combinatorial explosion by propagating that through to every string type (rope/cons, slice/view, etc) |
21:48 | <shu> | you can compute it in sub-linear time for all the non-direct (i.e. don't own their own char buffer) string types, except for slices, which i don't know how to do |
21:49 | <shu> | i think you always have to re-iterate sliced strings in some cases |
21:49 | <guybedford> | is it not possible to add something as a placeholder that is basically just "well-formed | unknown" where the default is "unknown", and then it can be ignored by the other types for now? |
21:50 | <guybedford> | for slices, I think the thinking was that careful checking of the slice ends could be done for surrogate splits, if you know the string being sliced is valid |
21:50 | <shu> | that is pretty much the same amount of wiring, no? |
21:50 | <shu> | right, if the underlying of a slice is well-formed, you just need to check the beginning and the end |
21:50 | <shu> | if the underlying is not well-formed you have to re-iterate afaict |
21:51 | <guybedford> | well everything could ignore the well-formed bit and just recompute by default without that being wiring? |
21:51 | <guybedford> | or does even just adding the bit add overhead there? |
21:51 | <shu> | well, if everything just ignores it why add a bit? reserve it now you mean? |
21:52 | <shu> | for V8 there's also the grossness of external strings, where the embedder can choose to take ownership of the character buffer |
21:53 | <shu> | 1-byte or 2-byte encodings are exposed for external string data, and well-formedness would need to be as well i guess? |
21:54 | <guybedford> | const s = "well-formed 😊"; s.isWellFormed() could in theory still benefit as some kind of baseline if the initial static string could pass the bit |
21:54 | <shu> | how is the initial static string well-formedness computed? |
21:54 | <guybedford> | yeah this affects webidl as well |
21:54 | <guybedford> | although again, you're way ahead - the initial expectation was just to keep it simple very much so |
21:55 | <shu> | yeah this can probably all wait pending real world usage data |
21:55 | <guybedford> | when the original encoding pass is taking place? |
21:55 | <shu> | ah, yeah, there are places where we would iterate strings anyway and we might be able to piggyback the well-formedness checks |
21:56 | <shu> | determining 1-byte is one of them, computing hash codes is another |
21:56 | <guybedford> | yeah, in theory external strings might want to set this stuff themselves if they can be trusted to provide that information |
21:57 | <shu> | i think this is all to say the "implementations can make this sub-linear and fast today without much work" only holds for 1-byte strings, which means we'll ship with a performance cliff |
21:57 | <shu> | which is fine (but isn't ideal) and wasn't something i realized |
21:57 | <Andreu Botella> | sub-linear? aren't 1-byte strings always well-formed? |
21:57 | <shu> | yes, that's what i mean, it's constant time for 1-byte strings |
21:58 | <shu> | but there were vague discussions of its being able to be sublinear for 2-byte as well |
21:58 | <shu> | and i kinda thought oh okay that sounds plausible without thinking about it much |
22:00 | <guybedford> | I guess the only concern is if this gets put as something that is considered a "cliff" and no one tackles it...? |
22:01 | <shu> | yeah, and that's a pretty generic concern |
22:01 | <guybedford> | having some idea of what the cliff is and how to approach it in future would certainly help... so it's interesting to hear what you think here |
22:02 | <shu> | (i'm more remarking on the interestingness of the problem space here, do not take what i've said as implementation concerns for shipping) |
22:03 | <shu> | well, the cliff for the naive implementation (constant time for 1-byte, flattening + linear time always for 2-byte) is fairly stark, so for large strings apps can definitely notice |
22:04 | <shu> | as for how to approach, the ideas we've talked about here all sound reasonable to try out |
22:04 | <shu> | that is, caching bit and piggybacking on other operations that already need to iterate the code units anyway |
22:05 | <guybedford> | glad to hear constant time doesn't sound untenable, which wasn't clear when these were purely hypothetical discussions |
22:06 | <shu> | amortized constant time? |
22:06 | <guybedford> | definitely makes sense to think about the integration points of this |
22:06 | <shu> | well i guess even for 1-byte it's amortized constant time |
22:06 | <guybedford> | Rust <> JS string sharing are cases where these expectations are important |
22:08 | <shu> | remind me again, rust strings are... wtf8? |
22:09 | <guybedford> | oh no, this is the point - they are strictly valid UTF8 |
22:09 | <bakkot> | the more relevant thing is wasm strings, which will be list-of-usv, I believe |
22:09 | <bakkot> | list-of-usv implies no unpaired surrogates |
22:09 | <Andreu Botella> | you might be thinking of OsString (for filesystem paths, etc.) which in Windows is WTF-8 |
22:10 | <shu> | yes let's see how far stringref gets |
22:11 | <guybedford> | also note Wasm components will also want this kind of guarantee for strings |
22:11 | <bakkot> | wasm components is the thing I meant yeah |
22:13 | <guybedford> | but in short, we don't need to worry too much about it initially, but it would be nice if there's a clear implementation path to speak to here |
22:13 | <shu> | i believe there is for v8 |
22:14 | <shu> | given that it's non-trivial if you care about the performance in the limit, should confirm with others |
22:14 | <guybedford> | I mean there's no reason to be doing non-trivial work before christmas... |
22:15 | <guybedford> | but glad to hear that |
22:15 | <littledan> | I think Wasm strings integration into various toolchains are likely to be a lot more important for performance than this function |
22:16 | <shu> | true enough, i'm not sure how important the JS integration performance is for any performance-critical wasm workloads right now |
22:16 | <shu> | in the future i expect this to be important |
22:28 | <bakkot> | seems plausible that you might end up wanting to keep track of the "are there surrogates" bit for wasm integration anyway |
23:51 | <littledan> | I think deeper JS/Wasm integration and Wasm component model are sort of a similar order of magnitude away from being performance-critical issues (that is to say, probably not today) |
23:52 | <littledan> | but that doesn't mean we shouldn't design towards it |