01:42
Philip`
finds another html5lib bug
01:43
<Philip`>
and I now have tests for almost every state transition in the tokeniser, only missing the ones that require non-PCDATA (which I don't handle yet)
02:07
<Philip`>
http://canvex.lazyilluminati.com/misc/test3.test
02:07
<Philip`>
hsivonen: I see a couple of types of test failure in your tokeniser
02:08
<Philip`>
(for "<!doctype! ?>" (too few parse errors) and "<z/0 >" (it misses the attribute), and variations of those)
02:14
<Philip`>
((It wasn't intentional for my tests to use "<!doctype!" and "<z/0" so much - that's just what fell out of the sorting function))
09:30
<hsivonen>
Philip`: thank you. I fixed bugs exposed by your test cases. One test case failure is a bug in your test cases, though: "<z/0 0" that should give 3 errors: non-permitted slash, EOF in attribute name and duplicate attribute "0".
09:30
<hsivonen>
Philip`: are you planning on contributing your tests to html5lib?
10:20
<Hixie>
http://lists.w3.org/Archives/Member/w3c-html-cg/2007JulSep/0013.html is interesting
10:24
<hsivonen>
indeed
10:24
<Hixie>
the "kitchen sink" threads are also interesting
10:25
<Hixie>
in that it seems a lot of people on that mailing list don't really understand what's going on
10:25
<Hixie>
oh well
10:29
<hsivonen>
Hixie: do you mean vigorously lobbying for a detail that is already in the spec?
10:31
<hsivonen>
someone really needs to write a primer on diminishing returns, externalities and network effects for the WG, but I'm pretty sure that if someone did, (s)he'd be slammed for not being a real economist
10:44
<Hixie>
hsivonen: i meant like being worried that XBL2 points to HTML5 and that therefore the security thing might not be defined, etc... missing the whole point that i had to write the security thing anyway, it didn't matter which spec i put it in
10:44
<Hixie>
anyway
10:44
<Hixie>
bed time
10:44
<Hixie>
probably will be online very spottily for the next three weeks
10:53
<hsivonen>
nn
10:54
<hsivonen>
Hixie: oh you referred to public-appformats
11:35
<Philip`>
hsivonen: Oh, whoops, I haven't done anything about duplicate attributes
11:35
<Philip`>
(I guess html5lib hasn't either, since it passed that test)
11:39
<Philip`>
Looks a bit irritating how it says to drop the attribute value before you've actually got an attribute value at all...
11:40
Philip`
tries to think of a nice way to handle that
11:41
<hsivonen>
Philip`: I have a boolean flag
11:42
<hsivonen>
Philip`: and I defer the actual addition of an attribute til the value is complete or know not to exist
11:42
<hsivonen>
Philip`: which was also the source of one class of test case failures you found
11:48
<Philip`>
Adding another state variable makes other things more complex (like when verifying you never have to alter an attribute unless there actually is an attribute), so it'd be nice to avoid that if possible
11:53
<Philip`>
(Well, it's fine to add a state variable into the C++/etc implementation, but preferably not into the conceptual model of the algorithm)
12:04
<jgraham>
Philip`: So do you want html5lib commit access (hint, hint)?
12:10
<hsivonen>
would someone like to volunteer to check an email about bikeshedding, belling the cat and economics 101 for suitability of sending to public-html? in particular, checking whether it is offensively patronizing?
12:11
<Philip`>
jgraham: Oops, I forgot to respond to hsivonen's second comment - I expect it would be good to add these tests to html5lib (which is why I named the file test3.test already :-) )
12:12
<Philip`>
at least once I've fixed the bugs, and added manually-written tests for the other bugs I have in my code
12:12
<jgraham>
Philip`: I went ahead and gave you commit access whether you wanted it or not :)
12:12
<Philip`>
jgraham: I just saw that - thanks :-)
12:28
Philip`
wonders if it matters that his test cases don't have good descriptions
12:29
<jgraham>
Philip`: I think I've fixed issue 50. Your testcases would be most welcome now so I can have some confidence that I did the right thing
12:29
<jgraham>
And also because I promised them in the commit log :)
12:30
<Philip`>
Just trying to fix the duplicate-attribute issue, which hopefully won't take long :-)
12:30
<jgraham>
Descriptions are good but probably not essential - the treebuilder tests are all description free, for example
12:30
<jgraham>
But if you can add them, please do :)
12:31
<Philip`>
I have no idea what most of my test cases are doing, so I don't know how to usefully describe them
12:32
<Philip`>
Maybe I could convince the test-generating program to work out why it's choosing those particular ones, but that seems like more effort than would be worthwhile
12:33
<jgraham>
Philip`: I guess if you keep all auto-generated tests to their own file it's fine
12:33
jgraham
notices he changed something and forgot to run the treewalkers tests
12:34
<hsivonen>
hmm. I guess I just send the message at the risk of offending some people
12:35
<jgraham>
hsivonen: Go for it. At least then we'll get email about how offended people are rather than whether or not Anne should include all his optional tags
12:36
<jgraham>
which is probably the most boring thread ever
12:36
<jgraham>
;)
12:38
<hsivonen>
jgraham: sent
12:41
<hsivonen>
enjoy: http://lists.w3.org/Archives/Public/public-html/2007Jul/0507.html
12:56
<hsivonen>
was I too offensive?
12:57
<Philip`>
jgraham: Committed the new tests now
12:58
<jgraham>
Philip`: Cool
12:58
<Philip`>
including one with duplicate attribute values, which html5lib fails
12:58
<Philip`>
(hsivonen's implementation passes all those tests now)
12:59
<jgraham>
hsivonen: No, I don't think so. Possibly a little terse, but if people read the links they should get the idea (I'm just reading the joel on software one which I don't think I've seen before)
13:12
zcorpan_
likes hsivonen's terse style
13:13
<Philip`>
Hmm, the html5lib Ruby tokeniser doesn't seem entirely happy with EOFs
13:13
<Philip`>
(resulting in various things like <"undefined method `+' for :EOF:Symbol">)
13:51
<jgraham>
Philip`: All your tests seem to pass now
14:14
<Philip`>
jgraham: That must mean more tests are needed ;-)
15:28
<Philip`>
More tests says: html5lib doesn't lowercase tag/attribute names
16:04
<jgraham>
Philip`: We lowercase them at the tree construction stage (because Sam reuses the tokenizer in situations where case is important)
16:06
<Philip`>
Ah, okay
16:06
<hsivonen>
jgraham: what are those situations?
16:07
<Philip`>
How would it best to test that tokenisers do implement what the spec says (with lowercasing names), while accepting that html5lib doesn't do that at that point?
16:08
<hsivonen>
jgraham: out of curiosity, why didn't you parametrize this in the tokenizer?
16:08
<Philip`>
(And does html5lib work correctly when you do <a a=1 A=2>?)
16:09
<hsivonen>
(I think I've been a bit naïve with the way I handle lower casing per spec instead of having a readCaseFolded() method)
16:57
<Philip`>
hsivonen: Your entity overflow code doesn't quite work - with input like &#x100000041; the value overflows from 0x10000000 to 0x00000000 and it's never negative so it never hits the overflow-handler
16:58
Philip`
will upload tests for that at some point
16:59
<hsivonen>
Philip`: ouch. good point
16:59
<gsnedders>
hsivonen: Sam uses it for parsing XML
16:59
<gsnedders>
hsivonen: (the XML having failed to be processed by an XML parser)
17:03
<hsivonen>
Philip`: fix (I think) checked in
17:03
<hsivonen>
Philip`: thanks
17:09
<Philip`>
hsivonen: Seems to work perfectly now
17:11
<hsivonen>
looks like my code is bad enough to generate community interest after all :-)
17:12
<Philip`>
I'm always interested in breaking things ;-)
17:14
<Philip`>
I'd be interested in trying to generate stack-using code like yours, and seeing how that works in comparison with switch-statements or gotos
17:15
<Philip`>
though I'm not sure how much automatic transformation I can do to extract stackiness, and I'm too lazy to do that manually
17:16
<hsivonen>
Philip`: my expectation is that the code won't be stack-based once server HotSpot has done its thing
17:16
<Philip`>
though that reminds me that I need to collect a set of documents for performance testing...
17:16
<hsivonen>
if the expectation is incorrect, running a byte code-level optimizer would make sense
17:19
<hsivonen>
Philip`: either way, it seems to me that it is easier to convert the kind of code I have written to unconditional jumps than it would be for a switch (which, OTOH, would guarantee no worse that conditional jumps)
17:21
<Philip`>
It seems quite possible that HotSpot could give better performance for your code than for switch-based code, if it's doing lots of inlining and tail-call optimisation - it'd be interesting to see how well it works in practice
17:21
<hsivonen>
Philip`: so whether the use of methods vs. one huge switch makes sense depends on what HotSpot really does
17:21
<Philip`>
That makes sense
17:22
<Philip`>
Unfortunately C++ doesn't have the advantage of dynamic compilation, so I guess things will act totally differently there
17:22
<Philip`>
but fortunately it doesn't have dynamic compilation, so you can usually have some idea of what the compiler's actually going to do to your code :-)
17:23
<hsivonen>
Philip`: the problem is that testing which approach really performs better on HotSpot is a non-trivial task. Which is why I went with an unverified educated (hopefully :-) guess
17:23
<hsivonen>
Philip`: yeah, I'd bet on switch in the C++ case
17:24
<Philip`>
That's why I'd like to be able generate different implementation approaches from the same source data, which is still non-trivial but involves much less typing :-)
17:25
<hsivonen>
besides, for Gecko-like threading (lack thereof), on would want to have a switch-based parser with states broken down even further so that each state reads at most one character
17:25
<hsivonen>
this way, the state variable would effectively store the current continuation
17:26
<hsivonen>
and the tokenizer could be interrupted after any input character
17:26
<hsivonen>
s/on would/one would/
17:27
<Philip`>
What happens when you need to look ahead by ~6 characters at once?
17:27
<hsivonen>
Philip`: I don't.
17:27
<hsivonen>
Philip`: my max lookahead is one read()/unread()
17:28
<hsivonen>
Philip`: otherwise, I buffer pessimistically and look back
17:28
<hsivonen>
Philip`: when I start consuming a doctype, I start building a bogus comment in parallel just in case
17:30
<hsivonen>
I should learn how to dump native code disassemblies from HotSpot some day
17:32
<Philip`>
Ah, okay, so if you had "<!docty>"(network latency) then it would emit the token before running out of characters
17:32
<hsivonen>
yes
18:46
<hsivonen>
I wonder what the Wikipedia article on "HTML 5" means when it says "Elements no longer compatible with HTML 4 – a, hr, strong"
18:49
<othermaciej>
[NEEDS CITATION]
18:50
<Philip`>
Looks like they just chose a random selection of points from html4-differences
18:50
<Philip`>
and in that case, particularly the points under "These elements have new meanings in HTML 5 which are incompatible with HTML 4"
18:52
<Philip`>
(Not entirely sure what the point is in duplicating that data badly, when there's an external link to html4-differences)
19:13
zcorpan_
edited the wiki page: "Elements with redefined meaning which are not compatible with HTML 4 – a, hr, strong"
19:42
<jgraham>
hsivonen: re: why case handling wasn't paramterized in the tokenizer: I don't know. I think Sam just picked a solution that did what he wanted. Is there a good reason to prefer a different approach
19:42
<jgraham>
?
19:43
<jgraham>
Philip`: I'll change the html5lib test harness to do the same thing with attribute names as the treebuilder
19:48
<hsivonen>
jgraham: reasons are running tokenizer-level tests and eliminating duplicate attributes
19:50
<hsivonen>
jgraham: going forward if we integrate SVG, a flag you can toggle in mid-tokenization might become useful
19:50
<jgraham>
hsivonen: That's a good point
19:50
<jgraham>
OK, I think I will change it to work with a flag
19:51
<hsivonen>
perhaps in the future I move case folding to one place behind a flag so that case folding writes back to the read buffer
19:52
<hsivonen>
this way I could avoid name copying whenever a name doesn't cross the read buffer boundary
19:54
<hsivonen>
the fun part would be that then one could make a legitimate claim that lower case is faster :-)
20:09
zcorpan_
likes <xmp> and wonders why it was deprecated way back when
20:10
<zcorpan_>
even pretending html to be sgml it's just an ordinary rcdata element, isn't it?
20:10
<Philip`>
Maybe because they couldn't work out a good way to show authors an example of how to write <xmp>...</xmp>, without the example closing itself half-way through?
20:10
<hsivonen>
zcorpan_: there's a subject for a fun public-html thread
20:10
<Philip`>
Opera's <xmp> parsing is very broken, unfortunately :-(
20:10
<Philip`>
I think it was actually mentioned on public-html some months ago
20:10
<zcorpan_>
Philip`: &lt;/xmp>
20:11
<Philip`>
zcorpan_: That won't work, except in Opera
20:11
<Philip`>
since it ought to just show the text "&lt;/xmp>"
20:11
<zcorpan_>
oh, it's a cdata element even
20:14
<Philip`>
http://software.hixie.ch/utilities/js/live-dom-viewer/?a%3Cscript%3E%3C/script/%3Ea%0Aa%3Cstyle%3E%3C/style/%3Ea%0Aa%3Cxmp%3E%3C/xmp/%3Ea - that's rather odd in Firefox
20:17
<zcorpan_>
Philip`: file a bug? :)
20:19
<Philip`>
No need - it'll all be perfect once they've just implemented HTML5 ;-)
20:26
<zcorpan_>
http://software.hixie.ch/utilities/js/live-dom-viewer/?%3Ca/x%3E%3Ca/x/%3E%3Ca/x/x%3E%3Ca/x%20%3E%3Ca/x%20x%3E
20:35
<zcorpan_>
<xmp><!--</xmp>--></xmp> ;)
21:50
<Philip`>
Looks like my OCaml implementation isn't very good - the C++ one is 150 times faster...
22:00
<Philip`>
Oh, right, that's because I'm making it read from stdin one character at a time
22:10
<Philip`>
Aha - the OCaml one is now only four times slower than the C++ one, for tokenising the HTML5 spec
23:25
<Philip`>
http://canvex.lazyilluminati.com/svn/tokeniser/ is the current version of [not quite all of] my code