| 18:46 | <krit> | annevk: won’t make it before noon tomorrow (personal reasons) |
| 19:50 | <Hixie> | got it down to 8.5s user+sys to do all the tests and parse the html spec and reserialise it |
| 22:58 | <gsnedders> | which of the encodings defined in Encoding are not ASCII-supersets? |
| 22:59 | <caitp> | ebcdic |
| 23:00 | <gsnedders> | caitp: is not in Encoding |
| 23:01 | <caitp> | i know |
| 23:01 | <gsnedders> | then it by definition is not an encoding defined in Encoding which is not an ASCII-supserset |
| 23:01 | <caitp> | anyways, it would be anything which doesn't have the "if it's less than 0x80, return it" |
| 23:01 | <caitp> | clause |
| 23:02 | <caitp> | with the exception of the utf16 stuff |
| 23:02 | <gsnedders> | and possibly some of the SBCSes, as at least ibm866 isn't |
| 23:02 | <zewt> | i don't recall there being any at all, ascii-compatibility is pretty fundamental |
| 23:02 | <caitp> | utf16be isn't really ascii-compatible |
| 23:02 | <caitp> | on a little endian system |
| 23:03 | <gsnedders> | no variant of UTF-16 is an ASCII-superset |
| 23:03 | <zewt> | it's not a multibyte encoding at all, double-byte encodings are a different world entirely |
| 23:03 | <caitp> | well, they are sort of |
| 23:04 | <caitp> | if the low byte is the first byte read, and you're skipping a byte for each character, and the code points are all below 0x80 |
| 23:04 | <zewt> | oh yeah this http://krijnhoetmer.nl/irc-logs/whatwg/20111215#l-1034 |
| 23:05 | <zewt> | hope was to get ibm866 dropped, no idea if anyone actually tried |
| 23:06 | <zewt> | caitp: "skipping a byte for each character" if you have to skip every other byte then ... that's not a superset of ASCII. heh |
| 23:06 | <caitp> | it is for the first character you read ;) |
| 23:07 | <zewt> | as ascii supports streams which are longer than one byte long, that's also not a superset of ASCII :0 |
| 23:07 | <zewt> | ) |
| 23:07 | <caitp> | ascii is a text encoding and has no concepts of streams |
| 23:08 | <caitp> | a single utf16 character can look like a null-terminated ascii string |
| 23:08 | <zewt> | not sure what this has to do with the fact that UTF-16 is in no possible conceivable contrived way a superset of ASCII, heh |
| 23:09 | <caitp> | it is, because unicode is a superset of ascii, codepoints 0x00-0x7F, followed by latin1 extensions to ascii, followed by the rest of the basic multilingual plane |
| 23:09 | <zewt> | encodings that are streams of 8-bit units (ascii, utf-8, sjis, most of them) are typically treated as separate concepts to ones that are streams of 16-bit units (utf-16, ucs-2) or 32-bit (ucs-4) |
| 23:09 | <zewt> | ... utf-16 is not a superset of ASCII. sorry, this is too silly a conversation for me to bother with |
| 23:10 | <caitp> | unicode is a superset of ASCII, and if you look at patterns of bytes, it's possible that you can't tell the difference between certain single-character UTF16 strings, and certain null-terminated ASCII strings |
| 23:11 | <zewt> | no. an encoding which is a superset of ASCII is one where the same string of codepoints ("hello"), encoded with both encodings, results in the same block of data. |
| 23:13 | <caitp> | nonsense, we're in agreement that utf16 bye definition contains codepoints represented by a minimum of 16 bits, but that does not mean that codepoints between 0x0000 and 0x0080 aren't supersets of ascii, and can't look identical to certain ascii strings |
| 23:13 | <caitp> | obviously that depends on arch and doesn't include multi-character strings, byte that's irrelevant |
| 23:16 | <zewt> | you seem to have a deep misunderstanding of what "superset of ascii" means; it does not mean "every sequence of bytes that is valid ASCII is also valid UTF-16", it means "every sequence of bytes that is valid ASCII *has the same interpretation* in UTF-16", which is obviously false |
| 23:16 | <zewt> | anyhow, going to do something else now :) |
| 23:17 | <caitp> | that's one definition of superset, but when you get down to patterns of bits, it's not the case |
| 23:17 | <caitp> | but regardless I agree it's not a super important discussion to have |
| 23:17 | <caitp> | nobody cares about utf16 =) |
| 23:18 | <gsnedders> | Plenty of people care about UTF-16 and it's used plenty |
| 23:19 | <caitp> | it's not really used in any serious capacity for interchange of data |
| 23:20 | <gsnedders> | Plenty of CJK sites use it |