#whatwg on 2007-07-06

00:00	<Philip`>	Oh, I guess I just need to fix my concept of 'current token' so it's not simply the most recent token on the stack
00:00	<hsivonen>	Philip`: stack?
00:01	<Philip`>	Well, append-only stack
00:01	<Philip`>	so, er, I guess it's more like a list
00:03	<hsivonen>	Philip`: do you mean your test harness builds a list of tokens?
00:03	<hsivonen>	I was just thinking that there's no stack in the tokenizer
00:05	<Philip`>	The tokeniser itself builds a list of tokens (and then prints them all out at the end)
00:05	<Philip`>	(though I can change it to not do that, because it only ever needs a single current token and a bit of cheating to merge character tokens)
00:19	<Philip`>	Ooh, it sort of almost works, some of the time
00:22	<Hixie>	interesting, i never considered parse errors as a token type
00:22	<Hixie>	i just treat them as an out-of-band callback called during parse (my parser is synchronous, it returns a complete document once the parsing is done)
00:25	<hsivonen>	Hixie: I treat both errors and tokens as callbacks
00:25	<Hixie>	right
00:25	<hsivonen>	they are on different interfaces but the handler that generates JSON implements both
00:27	<Philip`>	Now I pass all of test1.dat and test2.dat except for about half of them which are just bits I haven't quite implemented yet
00:29	Philip`	needs to write a Perl one after this
00:35	<Philip`>	(Actually, I probably don't, since there'd be no point at all)
00:35	<webben>	why not?
00:37	<Philip`>	Because a tokeniser by itself isn't very useful :-)
01:30	<Philip`>	http://canvex.lazyilluminati.com/misc/states6.png - now with fewer bugs than before, since implementation seems to pass most of the tests now
01:31	<Philip`>	*the implementation
01:32	<nickshanks>	yay squiggly lines
01:33	<Lachy>	Philip`: what's the difference between red and black lines?
01:33	<Philip`>	(Oh, I segfault on <x y="&">, which can't be good)
01:34	<nickshanks>	an especially squiggly red one going from CommentEndDash to Data
01:34	<Philip`>	Lachy: Red is transitions that are parse errors, black is transitions that probably aren't
01:34	<Lachy>	ok
01:34	<Philip`>	("probably" because of the parse-error-unless-it's-a-permitted-slash thing, which the graph treats as not-an-error)
01:35	<Hixie>	that's awesome
01:35	<Hixie>	why not have a red line and a black line when you have the permitted slash thing?
01:35	<Hixie>	you do that elsewhere
01:35	<Hixie>	yay, bogus doctype only has red arrows leading to it
01:35	<Hixie>	same with bogus comment, yay
01:36	<Hixie>	you really should use another colour for the EOF transitions
01:36	<Hixie>	in fact maybe we should have an EOF state
01:36	<Hixie>	instead of having EOF go back to the dat astate
01:37	<Philip`>	I can't easily have both because I only generate one arrow per transition from the original algorithm, and then delete all duplicates, so it only ends up with red+black when there are two separate transitions between the same states
01:37	<Hixie>	ah ok
01:37	<Hixie>	didn't realise it came from actual code
01:38	<Hixie>	that graph is awesome
01:38	<Philip`>	It's not entirely actual code - the algorithm is represented as data in OCaml, and I can generate that graph or a C++ implementation from that data
01:38	<Hixie>	it shows that there are really three basic ideas
01:38	<Hixie>	aah
01:38	<Hixie>	cool
01:39	<rubys>	is there any reason why you couldn't generate a, say, Python or Ruby implementation from that data?
01:42	<Philip`>	http://canvex.lazyilluminati.com/misc/states7.png - unless I did something wrong, that has blue lines for every transition that cannot occur if EOF is never consumed
01:42	<Philip`>	(i.e. all the transitions that are (at least partially) caused by EOF)
01:43	<Hixie>	can you try it with a separate state for EOF? or is that more effort than it's worth?
01:43	<Hixie>	it'd be cool to have the arrows go down to another state for EOF, it would look less cluttered i'd think
01:43	<Hixie>	just an idea, don't worry about it if it's more work than a few seconds :-)
01:43	<Philip`>	rubys: I don't think there is any reason why that wouldn't work
01:43	<Hixie>	this is really cool
01:45	<Philip`>	I've still had to manually write a few hundred lines of C++ (which would need to be ported to other languages), mostly for the entity parsing (since that's too boring to do in a more generic way), but then it generates a thousand lines of state-machine code automatically
01:45	<rubys>	I don't know O'Caml, but this sounds like a wonderful excuse to learn. Will you be publishing your source at some point?
01:45	<Philip`>	I didn't know it either, so I'm using it as exactly the same excuse ;-)
01:46	<Philip`>	I'll try to upload what I've done soonish
01:46	<KevinMarks>	looks like it could be used to generate code coverage testcases too
01:47	<Philip`>	I'm sure there must be a way to add in a new EOF state in about three lines of code, but I'm also sure they'll take a few minutes to work out...
01:55	<Philip`>	http://canvex.lazyilluminati.com/misc/states8.png
01:55	<Philip`>	(Hmm, it took fourteen lines)
01:56	<Hixie>	sweet
01:57	<Hixie>	that's totally awesome
01:58	<Philip`>	Now I just need to make it able to generate the spec text from the algorithm ;-)
01:58	<Hixie>	hah
02:00	Philip`	wonders if people have experience of how much more time it takes to implement tree construction compared to tokenisation
02:00	<Hixie>	about twice as long to write, about three times as long to debug, iirc
02:00	<Hixie>	but it's not especially hard
02:00	<Hixie>	just tedious
02:07	<rubys>	why is there a blue arrow from data to data?
02:11	<Philip`>	Because I modified the algorithm so any case which is triggered by EOF and causes a transition into the Data state, was changed to transition into the EOFData state
02:11	<Philip`>	but the relevant part inside the Data state bit of the algorithm doesn't transition into the Data state
02:12	<Philip`>	(because I didn't bother writing in the "stay in the same state" bits explicitly)
02:12	<Philip`>	so that could be considered a bug in my old-algorithm-to-new-algorithm transformation code, but it'd require too much effort to fix :-)
02:18	<Philip`>	Hmm, it's far too easy to get exponential growth in these things
02:20	<Philip`>	http://canvex.lazyilluminati.com/misc/states9.png - I'm not sure why it's gone quite that bad
02:21	<Hixie>	holy crap what the hell is that
02:21	<Hixie>	states * pcdata etc?
02:22	<Philip`>	Yes
02:23	<Philip`>	I suppose it's unhappy because lots of states emit start/end tag tokens when they see EOF, and the tokeniser can't tell what the tree constructor is going to do to the content model flag when that happens, so I assume it could end up being set to anything, which causes unpleasant growth
02:23	<Philip`>	("I assume" = "I tell the code to assume")
02:26	<Philip`>	Looks like that is the case - http://canvex.lazyilluminati.com/misc/states10.png is far better without the EOFs
02:27	<Lachy>	Philip`: what are you using to create those flow charts?
02:28	<Philip`>	Graphviz
02:30	<Philip`>	(It does tend to collapse into a mass of unreadable squiggles when you get past a certain size, and I always tend to use it on things that approach that size, but I've not heard of anything else that does the same kind of thing)
02:31	<Philip`>	(Uh, "same kind of thing" = drawing graphs, not collapsing into squiggles)
02:46	<minerale>	What is whatwg ?
02:47	<wildcfo>	u mean this channel?
02:47	<minerale>	The website needs an 'about' link
02:47	<minerale>	I just saw the site in a slashdot sig, went there and was not sure how it related to silverlight
02:50	<minerale>	is it some kind of social front end to w3c's html specifications?
02:51	<Hixie>	minerale: see http://blog.whatwg.org/faq/#whattf
02:52	<Hixie>	minerale: we're basically the renegade group that started html5
02:57	<Philip`>	My entirely unoptimised C++ tokeniser (which no longer segfaults) takes about 0.4 seconds for the HTML5 spec, which doesn't seem too bad
02:59	<Philip`>	(It's certainly a bit useless, because it just computes all the tokens and then memory-leaks them away)
03:12	<Hixie>	Philip`: yeah tokenising is easy
03:13	<Hixie>	Philip`: the tree construction is definitely the more expensive part
07:46	<Hixie>	http://html5.googlecode.com/svn/trunk/data/
07:46	<Hixie>	enjoy
07:47	<Hixie>	(hsivonen, jgraham, Philip`, anyone else writing an HTML5 parser ^)
08:28	<hsivonen>	Hixie: thank you
08:32	<Hixie>	hm
08:32	<Hixie>	so i have data on attributes-per-element and suchlike
08:32	<Hixie>	but i don't know exactly what you want to know
08:33	<hsivonen>	cumulative percentages of x% had 0 attributes, y% had <=1 attributes, z% had <=2 attributes, etc.
08:33	<Hixie>	hm
08:36	<Hixie>	hm
08:36	<Hixie>	if 10 elements had 0 attributes
08:36	<Hixie>	and i know there were 20 elements
08:36	<Hixie>	and 5 elements had 1 attribute
08:36	<hsivonen>	I meant element instances, not element names, btw
08:37	<Hixie>	that means 75% had <= 1
08:37	<Hixie>	right?
08:37	<Hixie>	yeah
08:37	<Hixie>	i know
08:37	<hsivonen>	yes
08:37	<Hixie>	so i just need to add numbers until i get to one where i don't know the number
09:12	<Hixie>	hsivonen: ok, see http://html5.googlecode.com/svn/trunk/data/misc.txt
09:16	<hsivonen>	Hixie: thank you
09:17	<Hixie>	my pleasure
09:17	<hsivonen>	elements with <= 0 attributes: 33.5% is lower than I would have guessed
09:18	<Hixie>	most documents consist primarily of <td>s with bgcolors, <font>s, and such like
09:19	<hsivonen>	I had guessed that 3 attributes is the common case. not such a bad guess
09:20	<hsivonen>	I readjust my guess to 5
09:20	<Hixie>	amusingly, the more documents i scan, the greater the portion that is XHTML
09:21	<Hixie>	in a sample of several dozen billion documents, it was about 0.2%, vs 0.02% for a sample of only a few billion (smaller sample being biased towards western pages with higher page rank)
09:22	<Hixie>	(0.2% is vs 97.5% for text/html)
09:29	<hsivonen>	Hixie: testing whether the stack has "table" in table scope is the same as checking whether there's a "table" on the stack at all, right?
09:30	<zcorpan_>	Hixie: did you find anything with <! ">" > ?
09:30	<Hixie>	zcorpan_: didn't have a chance to look into that yet
09:30	<zcorpan_>	ok
09:30	<Hixie>	zcorpan_: but the fact that only IE does it makes me think it's not a big deal
09:30	<Hixie>	hsivonen: um
09:31	<Hixie>	hsivonen: yeah, i guess so
09:31	<hsivonen>	Hixie: ok. thanks
09:31	<Hixie>	does the spec ever ask that?
09:33	<hsivonen>	Hixie: yes
09:33	<hsivonen>	Hixie: I sent email
09:33	<Hixie>	k
09:35	<othermaciej>	Hixie: so obviously XHTML lowers your pagerank!
09:35	<othermaciej>	evil google conspiracy!
09:35	<Hixie>	othermaciej: lol
09:35	<Hixie>	i wouldn't be surprised if that was actually true
09:36	<Hixie>	i don't think google really supports xhtml
09:36	<Hixie>	we probably treat it as text/html and get all confused or something
09:37	<zcorpan_>	like mobiles?
09:37	<zcorpan_>	:)
09:43	<Hixie>	yeah, probably
09:53	<hsivonen>	Hixie: according to markp and Matt Cutts on rubys' blog, the XHTML non-support is changing
09:58	<hsivonen>	hmm. actually, neither of them said anything about Google parsing XHTML right...
10:01	<othermaciej>	I wonder how much of the nominal html on the web is mobile-targeted (and therefore not really parsed as xhtml)
10:29	<zcorpan_>	othermaciej: btw, did you debug why dom2string hit a "Maximum call stack size exeeded" error in webkit?
10:31	<othermaciej>	zcorpan_: haven't had time so far
10:31	<othermaciej>	zcorpan_: can you remind me of the relevant URL?
10:31	<othermaciej>	I can try it now
10:34	<zcorpan_>	othermaciej: http://simon.html5.org/temp/html5lib-tests/
10:38	<othermaciej>	zcorpan_: thanks
10:42	<gsnedders>	does any HTML5 document meet the nesting requirements once parsed?
10:43	<zcorpan_>	gsnedders: what nesting requirement?
10:44	<gsnedders>	zcorpan_: things like <div>test<p>test</p></div>
10:44	<hsivonen>	gsnedders: yes.
10:44	<gsnedders>	like, the content model (I say remembering the name)
10:44	<zcorpan_>	gsnedders: oh. no.
10:44	<hsivonen>	gsnedders: no
10:45	<zcorpan_>	gsnedders: any stream of characters results in a tree. but it might not conform to the content model rules
10:46	<gsnedders>	I'm just thinking about how plausible it'd be to take arbitrary input and output (machine-checkable) conformant HTML5
10:48	<hsivonen>	gsnedders: you'd probably need methods similar to what John Cowan's TagSoup uses
10:48	<hsivonen>	gsnedders: the HTML5 parsing algorithm itself specifically is not about doing that
10:48	<gsnedders>	hsivonen: I know. I was just wondering how much it does do in itself.
10:48	<othermaciej>	even things that parse without parse errors could result in a non-conforming document
10:49	<zcorpan_>	gsnedders: it builds a tree
10:49	<hsivonen>	gsnedders: it makes sure that tables don't have intervening cruft and it moves stuff between head and body to head
10:49	<gsnedders>	My issue was really as to how close to being conforming the output of it was, and whether what it did change made those sections conforming
10:50	<othermaciej>	I wonder if all the machine-checkable conformance criteria are practically machine-fixable
10:50	<gsnedders>	othermaciej: invalid dates won't be.
10:50	<gsnedders>	(short of dropping them)
10:51	<hsivonen>	othermaciej: everything that is machine-checkable is machine-fixable to the point that the machine checker doesn't know the difference (but the result can be totally bogus)
10:51	<zcorpan_>	<title>s in body are still moved to head too, right?
10:51	<othermaciej>	you could use an ultra-lenient best-guess date parser
10:51	<hsivonen>	case in point: filling alt attributes with junk
10:51	<gsnedders>	othermaciej: even that will have limitations.
10:51	<hsivonen>	or copying src to longdesc to please an accessibility checker
10:52	<othermaciej>	whether an alt attribute is junk isn't machine-checkable, really
10:52	<hsivonen>	othermaciej: that's my point
10:52	<othermaciej>	it might be than an image is a picture of the text "asdfjkl; i hate conformance checking"
10:52	<othermaciej>	and so that would be totally valid alt text
10:53	<hsivonen>	so anything that is non-conforming in a machine-checkable way can be replaced with stuff that is semantically junk but that is syntactically ok
10:53	<othermaciej>	I guess it depends on how you want to fix things
10:54	<othermaciej>	attribute values that would be discarded don't really matter as much as violating content models, in a way
10:54	<othermaciej>	because in the latter case, there might not be a conforming document that looks and acts the same (at least, without rewriting in-page scripts)
10:55	<hsivonen>	gsnedders: in many cases, you can "fix" content models by wrapping consecutive inline nodes in a single p node
11:20	<zcorpan_>	perhaps <foo => should be a parse error (since it doesn't do what the spec says in any of ie, safari, opera, firefox)
11:25	<othermaciej>	what does the spec say to do?
11:26	<zcorpan_>	create an attribute with the name =
11:26	<zcorpan_>	\| <foo>
11:26	<zcorpan_>	\| ==""
11:27	<zcorpan_>	opera, moz, safari drop the attribute. ie creates an attribute with the empty string as the name
11:27	<othermaciej>	the spec behavior is extremely weird then
11:28	<zcorpan_>	not really weird. but doesn't match any browser and it's not a parse error
11:38	<hsivonen>	zcorpan_: email time :-)
11:38	<zcorpan_>	emailed
11:46	<othermaciej>	zcorpan_: what blows the JS stack on that page is the runner/process mutual recursion
11:47	<othermaciej>	zcorpan_: not sure offhand why they call each other but perhaps it could be a loop instead
11:48	<othermaciej>	zcorpan_: at some point we will fix the stack limit, it should probably be higher than it is
11:48	<zcorpan_>	othermaciej: ah. so it's not the dom2string that is the problem
11:49	<othermaciej>	zcorpan_: well, it might have been a problem before, but those recurse deeply enough by themselves to exceed the limit
11:58	<zcorpan_>	othermaciej: works when i rewrote it to be a loop
11:59	<othermaciej>	zcorpan_: cool
11:59	<othermaciej>	zcorpan_: thanks for the workaround
12:59	<zcorpan_>	commited workaround to http://html5.googlecode.com/svn/trunk/parser-tests/
13:16	<Philip`>	The OCaml preprocessor makes my brain hurt
13:17	<hsivonen>	trying to think whether the tree building spec ask implementors do useless stuff takes time...
13:20	<Philip`>	Oh, nice, the camlp4 documentation has an example that does precisely what I'm trying to do, which means I don't have to understand anything and can just copy-and-paste it in
13:45	<zcorpan_>	hmm. getElementsByClassName doesn't take a string as argument. it did before, didn't it?
13:55	<Lachy>	zcorpan_: gEBCN() has gone though various iterations including a space separated string, varargs and array of strings.
13:56	<zcorpan_>	yeah. i thought it was either string or array. appears it is array only
14:02	<zcorpan_>	firefox has implemented it as either string or array
14:02	<zcorpan_>	it seems
14:03	<Lachy>	which version of FF supports it?
14:04	<zcorpan_>	3
14:04	<zcorpan_>	or actually, it only supports array when it has 1 item
14:05	<Lachy>	hopefully that can be fixed before FF3 ships
14:05	<zcorpan_>	it uses space-separated string
14:05	<zcorpan_>	that seems to be more practical anyway to me
14:06	<Lachy>	yeah, in some ways it is, but even with an array, it's not hard to do gEBCN(["foo"]);
14:08	<Lachy>	the array helps when you're programmatically creating a collection of class names, but the space separated string would probably be better optimised for the majority of cases
14:09	<zcorpan_>	classList is an array right
14:09	<zcorpan_>	or can be passed to gEBCN
14:09	<Lachy>	it probably is
14:10	<Lachy>	it's a DOMTokenList
14:10	<Lachy>	http://www.whatwg.org/specs/web-apps/current-work/#domtokenlist0
14:11	<zcorpan_>	does that fit the definition of "array" wrt what gEBCN can take as argument?
14:11	<Lachy>	whether or not a DOMTokenList can be passed to gEBCN would depend on the language binding
14:11	<zcorpan_>	for ECMAScript
14:13	<Lachy>	ideally, it should be possible to pass a DOMTokenList in all languages. I suggest you send mail about the issue
14:13	<othermaciej>	it can be passed, just not clear if the result will be useful
14:13	<othermaciej>	unless the toString conversion is defined
14:13	<othermaciej>	to do something good
14:13	<othermaciej>	which probably it should be
14:14	<Lachy>	the toString should probably return a space separated list of tokens as a string.
14:15	<Lachy>	but I don't think toString is relevant, given the current definition of gEBCN accepting an array
14:15	<zcorpan_>	othermaciej: why does toString matter?
14:15	<othermaciej>	I thought it took a string, sorry
14:15	<othermaciej>	defining it to take an array is weird
14:15	<zcorpan_>	why
14:15	<othermaciej>	it should at the very least accept a string also
14:16	<othermaciej>	it is true that you can do the varargs thing with an array and it also lets you build up a pre-made array
14:16	<othermaciej>	but it makes the common case more awkward
14:16	<othermaciej>	and it requires creating a wasteful temporary object for the common case
14:16	<Lachy>	yeah
14:17	<othermaciej>	and you can use .apply() to pass an array of arguments to a varargs function in JS
14:18	<zcorpan_>	perhaps the spec should be changed to only take a space separated string as argument. and defined DOMTokenList.toString to be useful
14:19	<Lachy>	it could probably be defined to accept either a space separated string, varargs, array or a DOMTokenList.
14:20	<Lachy>	I think that would be possible to define, using the IDL described in the latest DOM Language Bindings draft
15:49	<Philip`>	Ooh, looks like each tokeniser state can only ever be entered with one (or zero) type of current token
15:50	<Philip`>	which is nice because it means I can just cast the current-token pointer without any safety checks, since it's guaranteed to be the right type
20:00	<gsnedders>	FWIW: http://geoffers.no-ip.com/svn/php-html-5-direct — (Barely started) direct implementation of HTML 5's algorithms
20:03	Philip`	wonders why the list of whitespace characters differs from the list used in the tokeniser
20:06	<gsnedders>	Philip`: where is it different? It's the same, just in a different order.
20:06	<Philip`>	The tokeniser doesn't do U+000D
20:06	<gsnedders>	"U+000D CARRIAGE RETURN (CR) characters, and U+000A LINE FEED (LF) characters, are treated specially. Any CR characters that are followed by LF characters must be removed, and any CR characters not followed by LF characters must be converted to LF characters. Thus, newlines in HTML DOMs are represented by LF characters, and there are never any CR characters in the input to the tokenisation stage."
20:06	<gsnedders>	(Input Stream)
20:07	<gsnedders>	whereas within an attribute a CR could occur through a entity
20:09	<hsivonen>	gsnedders: the entity case maps to LF as well now
20:10	<gsnedders>	hsivonen: that was actually changed? ah. I guess the other parts exist to accommodate XHTML5, then?
20:10	<gsnedders>	actually, XML changes CR as well
20:11	<gsnedders>	"The only way to get a #xD character to match this production is to use a character reference in an entity value literal." — so you can get it through an entity in XML
20:12	<Philip`>	Does any of that conversion apply if you do document.write("<b\r>") ?
20:12	<gsnedders>	Philip`: it goes through the input stream, so yes
20:14	<Philip`>	Ah, okay
20:23	<hsivonen>	gsnedders: CR is now in the same table as the Windows-1252 NCRS
20:23	<hsivonen>	NCRs
20:23	<hsivonen>	gsnedders: if you put an NCR for CR in XML, you get a CR in the infoset/DOM
20:24	<gsnedders>	hsivonen: ah. I didn't notice it when I implemented that separately a few days ago (though I did just copy/paste the table and create code automagically). that's what I thought about XML, though.
20:24	<gsnedders>	trying to remember what specs say when so tired probably isn't sensible :)
20:25	hsivonen	notes that the fragment case does bad things to control flow
21:21	Philip`	wishes he could find a nice way to output C code from OCaml without just sticking lots of strings together, and without using 25K-line libraries with far too many dependencies
21:42	Philip`	wonders what would be the easiest way to prove the tokeniser terminates (assuming the character stream is finite)
21:42	<Philip`>	(I don't doubt that it does, but I like having a computer agree with me...)
21:43	<hsivonen>	oh you are actually proving stuff :-)
21:43	<hsivonen>	I just trust the html5lib tests :-)
21:46	<zcorpan_>	what is a conforming test case? don't we need conformance requirements for test cases?
21:46	<Dashiva>	Can you show the input position is steadily increasing?
21:46	<Philip`>	Since I've got the tokeniser in this format, I thought I might as well try proving various forms of correctness, to make sure I don't forget all the logic stuff I learnt at university :-)
21:47	<Philip`>	Dashiva: No, since it doesn't always steadily increase - some states don't always consume a character
21:47	<Dashiva>	But then you could take those states and how they're always part of a series of states increasing it
21:49	<Philip`>	I think that'd probably work - I don't know if it can be done automatically, but I guess it shouldn't be hard to manually define a (partial) ordering of states and check that (input_position, state) is always increasing
21:51	<hsivonen>	Philip`: do you use a read()/unread() model?
21:51	<hsivonen>	Philip`: can you prove that there are never two consecutive unreads without a read in between?
21:52	<hsivonen>	that would at least prove it isn't going backwards
21:53	<Philip`>	Is "unread" where the spec says "reconsume the character in the something state"?
21:53	<hsivonen>	I've made quite a few optimization by just looking hard at the tree building algorithm without proving anything...
21:53	<hsivonen>	Philip`: yes
21:53	<hsivonen>	Philip`: I call unread() before such transitions
21:55	<Philip`>	hsivonen: Okay - I've done about the same, with UnconsumeCharacter/ConsumeCharacter
21:56	<hsivonen>	Philip`: btw, what's you character datatype? an UTF-8 code unit? UTF-16 code unit? UTF-32 code unit?
21:57	<Philip`>	(I've tried to do as literal a translation of the spec text as possible, but 'unconsume' maps onto the state->state [where 'state' means the whole tokeniser state, not just the explicit ones in the spec] transition model much better than 'reconsume in some other state')
21:58	<Philip`>	The C++ implementation just uses a wchar_t, which is 2 or 4 bytes, but it ought to be relatively easy to change that to something better if I had any idea of what would work well
22:01	<Philip`>	I'd like to be able to just start with the original correct algorithm, and then have code that optimises it into a less naive structure, and then output that (as C++ or whatever else you want), though currently I've got none of the optimisation bit :-)
22:04	<Philip`>	(...and then if the spec changes, it'd all work nicely and easily since the optimisation things would just apply themselves to a different algorithm and produce a new correct tokeniser)
22:04	<Philip`>	(I expect this is all far more complex than necessary, but it's fun anyway)
22:45	<Hixie>	hsivonen: i haven't checked, but re </table>, what about: <table><td><ol><li></table> ?
22:45	<Hixie>	vs <table><td><p></table>
22:45	<Hixie>	and ignoring the missing <tr>s, oops
22:47	<hsivonen>	Hixie: well, yeah. I guess we want the errors there after all. my point was that <ol> gets one error anyway when it goes on the stack
23:04	<Hixie>	hsivonen: not in that case, you're in a cell there
23:16	<hsivonen>	Hixie: ooh. good point.
23:16	<hsivonen>	Hixie: except then you aren't IN_TABLE
23:18	<hsivonen>	well, I implemented the spec now
23:20	<Hixie>	doesn't in-cell defer to in-table in that case?
23:21	<hsivonen>	no. it closes the cell first
23:24	<Hixie>	ah
23:24	<Hixie>	hm
23:24	<Hixie>	well i'll look at it in detail at some point
23:24	<Hixie>	:-)