WHATWG on 2023-08-05

01:48	<zero-one>	i'm looking at the wpt test data for URLs, and i'm a little confused by https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L4725
01:49	<zero-one>	\uD800 \uD801 isn't a valid surrogate pair, so what exactly is meant by this particular input?
01:49	<zero-one>	my json parser chokes on it, understandably
06:41	<Domenic>	What is meant is to test what the URL parser does on invalid surrogate pairs
06:41	<Domenic>	The algorithm handles them just fine (since it operates on strings, which are sequences of 16-bit code units, including invalid surrogates)
07:23	<zero-one>	I'm sure the URL parser handles it fine, but I think this is invalid JSON
07:25	<zero-one>	will need to dig up a JSON parser that I can tell to ignore invalid surrogate pairs
08:34	<Domenic>	JSON.parse() works fine on it
14:34	<Richard Gibson>	`"\uD800\uD801"` is valid JSON but not guaranteed to be interoperable (especially if the decoder is in a language that internally uses UTF-8 for strings), cf. https://www.rfc-editor.org/rfc/rfc8259#section-8.2 However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate)… The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.