01:48
<zero-one>
i'm looking at the wpt test data for URLs, and i'm a little confused by https://github.com/web-platform-tests/wpt/blob/master/url/resources/urltestdata.json#L4725
01:49
<zero-one>
\uD800 \uD801 isn't a valid surrogate pair, so what exactly is meant by this particular input?
01:49
<zero-one>
my json parser chokes on it, understandably
06:41
<Domenic>
What is meant is to test what the URL parser does on invalid surrogate pairs
06:41
<Domenic>
The algorithm handles them just fine (since it operates on strings, which are sequences of 16-bit code units, including invalid surrogates)
07:23
<zero-one>
I'm sure the URL parser handles it fine, but I think this is invalid JSON
07:25
<zero-one>
will need to dig up a JSON parser that I can tell to ignore invalid surrogate pairs
08:34
<Domenic>
JSON.parse() works fine on it
14:34
<Richard Gibson>

"\uD800\uD801" is valid JSON but not guaranteed to be interoperable (especially if the decoder is in a language that internally uses UTF-8 for strings), cf. https://www.rfc-editor.org/rfc/rfc8259#section-8.2

However, the ABNF in this specification allows member names and string values to contain bit sequences that cannot encode Unicode characters; for example, "\uDEAD" (a single unpaired UTF-16 surrogate)… The behavior of software that receives JSON texts containing such values is unpredictable; for example, implementations might return different values for the length of a string value or even suffer fatal runtime exceptions.