04:10
<MikeSmith>
http://www.w3.org/2008/12/17-w3m-irc.html
04:12
<MikeSmith>
oops
04:12
<MikeSmith>
sorry
07:13
<MikeSmith>
Philip`: can you remind me where source for your spec splitter is?
07:13
<MikeSmith>
and what structure does it expect in the source doc it's meant to split?
08:46
<Philip`>
MikeSmith: http://code.google.com/p/html5/source/browse/trunk/spec-splitter/spec-splitter.py
08:47
<Philip`>
It expects the source document to be the HTML5 spec
08:49
<Philip`>
thought it may not be too hard to modify for other documents, mostly by modifying the split_exceptions variable (which defines some ids to split on, in addition to the default of all <h2>)
08:49
<Philip`>
*though
09:02
jgraham
apologises for breaking the meta tag handling in html5lib
09:03
<jgraham>
We need to implement the character encoding changing stuff
09:03
<jgraham>
Also maybe it would be a performance win to intern tag names and then compare with is rather than == in the tree builder?
09:04
<takkaria>
a massive perf win, I'd have thought
09:05
<jgraham>
takkaria: I guess that depends on how slow interning strings is and whether Philip` implements it or me :)
09:06
<Philip`>
The entire tree builder has very little cost compared to the tokeniser, as far as I can tell
09:07
<takkaria>
Hubbub doesn't intern tag names yet, and some 20% of runtime is spent in string comparisons on tag names ATM, IIRC
09:08
<Philip`>
Actually, I suppose "very little" might really mean "about a third of total runtime", which is still significant
09:08
<Philip`>
String comparison in Python is implemented in C, so it's relatively fast - I think it's all the Pythonic bits that are taking all the time
09:34
jgraham
is pretty sure you can't solve a theroem
09:38
<Philip`>
Is it not reasonable to say that e.g. Fermat's Last Theorem has been solved?
10:00
<jgraham>
Philip`: I guess that is common although I think it is inaccurate (should be Fermat's Last Theroem has been proven or something). However I'm the example in the spec that talks about "Solving Pythagoras' theorem" to mean "solving for the third side of a triangle given the other two sides" seems more wrong
10:02
Philip`
supposes it should have been called Fermat's Last Hypothesis
10:06
<Philip`>
jgraham: The spec says it's 'solving for some variable', which seems a different meaning to simply 'solving'
10:12
<Philip`>
(It seems probably clear enough that it means it's taking an equational form of Pythagoras' Theorem and solving that equation in terms of some variable, which is what it's doing)
11:29
<jgraham>
Philip`: I could tell what it was talking about but I still don't think that a theroem is something you can solve.
13:32
<yecril71>
A popup window can accomodate to page size.
13:40
<Philip`>
html5lib r1233 takes 16.1 seconds to parse the HTML5 spec
13:40
<Philip`>
html5lib r1241 takes 13.1 seconds to parse the HTML5 spec
13:40
<Philip`>
I suppose that's not an entirely worthless improvement
13:42
<MikeSmith>
Philip`: !
13:42
<MikeSmith>
you spec splitter, can you remind me where the source is?
13:43
<Philip`>
MikeSmith: http://krijnhoetmer.nl/irc-logs/whatwg/20081218#l-244
13:44
<MikeSmith>
ah, thanks
13:45
<MikeSmith>
dunno why I didn't see that before
13:46
<MikeSmith>
OK, so it splits on all H2s? (plus whatever IDs are in split_exceptions)
13:46
<Philip`>
Yes (except in things with class="no-toc")
13:46
<Philip`>
Uh, I mean:
13:46
<Philip`>
Yes (except for H2s with class="no-toc")
13:47
<Philip`>
No, I don't mean that
13:47
<Philip`>
Yes (except for anything before the first H2 without class="no-toc")
13:48
<Philip`>
Hmm, html5lib parses in 12.2 seconds if I make it use a string for input (instead of a stream)
13:50
<jgraham>
?!
13:51
<jgraham>
That seems odd. I thought we had to work to make strings look like file-like objects
13:51
<Dashiva>
stringio?
13:52
<jgraham>
iirc
13:52
<Philip`>
jgraham: Uh, I mean: ...if I make it use a string for input (instead of a stream) and change HTMLInputStream to behave more efficiently when it's got a string as input
13:53
<Philip`>
(There's no point messing around with chunks and unget buffers if you've already got the entire string stored in memory)
13:53
<jgraham>
Well that sounds more sensible ;)
13:54
<Philip`>
If I also disable the position-recording code, it parses in 11.4 seconds
13:55
<Philip`>
Might it be reasonable to add a constructor argument to disable position computation, which people can use if they're sure never going to care about positions (e.g. they're not going to look at parse error messages)?
13:55
<jgraham>
Philip`: Can you put a profile somewhere?
13:57
<Dashiva>
Philip`: How much do you unget in the worst cases? Enough that a double buffer for the file IO wouldn't be practical?
13:58
jgraham
wonders hat a double buffer is
14:00
<Philip`>
jgraham: Something like http://philip.html5.org/misc/html5lib-profile-r1241.txt ?
14:00
<Philip`>
Dashiva: One character
14:01
<Philip`>
jgraham: (That's from running parse.py --no-html -t)
14:02
<Philip`>
jgraham: (with the input being the HTML5 spec from 25 July 2008)
14:02
<Dashiva>
jgraham: (in this case) keeping two buffers so you can read into one and still have the other available for use. That way you always have old data available for ungetting
14:04
Philip`
notes that the current tokeniser updatePosition function is pretty much entirely unrelated to the function that was called updatePosition in the last-but-one revision of html5lib
14:05
<Philip`>
(Now it's the thing that counts newlines in the string returned from charsUntil, to update the position information)
14:05
<jgraham>
Philip`: Like that but sorted by total time rather than cumulative time, I guess
14:07
<Philip`>
jgraham: http://philip.html5.org/misc/html5lib-profile-r1241-2.txt ?
14:13
<Philip`>
(There is a sadly a lack of blatant bottlenecks :-( )
14:21
<zcorpan>
who's up for a naming debate? http://www.w3.org/Bugs/Public/show_bug.cgi?id=6298
14:22
<Dashiva>
I'm all out of bikes to shed
14:24
<jgraham>
zcorpan: HTML5 (All Your Error Are Belong To Us) should cover it I think
14:24
<jgraham>
Philip`: Indeed. Maybe we could just get Hixie to remove stuff from the spec so we have fewer tokens to process
14:25
<zcorpan>
or make everything valid so you don't need to spend time reporting errors
14:26
<Philip`>
zcorpan: It spends almost no time reporting errors in my profiling, since the input only has one error (missing doctype)
14:26
<zcorpan>
oh. ok
14:27
<Philip`>
(It's intentional that I'm profiling the parsing of the spec, rather than of more realistic web content, because my use case for html5lib is parsing the HTML5 spec and so that's what I want to optimise for)
14:27
<Philip`>
(If someone else wants to use html5lib for parsing loads of invalid content, they can optimise those parts themselves :-) )
14:29
<Philip`>
jgraham: He's already added 25% more bytes to the spec than the version I've been using :-(
14:31
<jgraham>
Philip`: Well that's no good at all.
14:31
<Philip`>
(Actually I've mostly been testing with the first 10^4 lines of the spec, because I'm too lazy to wait for the entire thing to be parsed every time I make a change)
14:50
<yecril71>
I do not think authors think about tags at all, they just want to publish information.
14:52
<yecril71>
18 comparisons would be enough for 2^17 insertion modes.
14:53
<Philip`>
"if an end user finds an error, he probably will report it to the owner of the web site, who in turn will report it (quite angrily) to web designer." - I don't think that's true - if I find an error on a web site, I just moan about it on IRC
14:54
<yecril71>
I sometimes do when the site steps on my toes too heavily.
14:54
<yecril71>
But I am not an ordinary end user anyway.
14:55
<Philip`>
If it's an ill-formed XML error, I'd find it hard to report to the owner even if I wanted to, because it would prevent me from reading their site and finding the contact details
14:55
<yecril71>
You can look for @ in the page source.
14:56
<yecril71>
(if it is not disguised as something else, that is)
14:56
<yecril71>
Frex, I reported the CSS proble with the spec to Ian, and he boldly ignored my complaint.
14:57
<Philip`>
I'm far too lazy to do that
14:57
<yecril71>
I have had more success with less knowledgeable webmasters though :-)
14:58
<yecril71>
Especially if they boast the W3C badge.
15:00
<yecril71>
Is cross-domain xsl:import guaranteed to throw?
15:00
Philip`
remembers (a very long time ago) using imdb.com as an example of why the <image> tag has to be supported, and someone saying that they were going to notify the IMDB web people about the problem as a demonstration that it's possible to clean up that kind of legacy mistake
15:00
<Philip`>
...and imdb.com still uses <image> today :-(
15:01
<yecril71>
Ian belongs to the minority who reply there is a problem with my browser.
15:02
<yecril71>
I have just got a similar response from Apple.
15:02
<yecril71>
But Ian was at least right, while Apple is wrong.
15:06
<yecril71>
html5lib is an implementation of HTML5 parser.
15:06
Philip`
is aware of that :-p
15:07
<yecril71>
<div><p>some text<p>some more text</p></p></div> is correct but it amounts to three paragraphs instead of two.
15:07
<Dashiva>
An unmatched </p> is correct?
15:08
<yecril71>
Isn�t it?
15:09
<yecril71>
The opening tag is optional so it can be inserted when needed.
15:10
<yecril71>
I do not think optional tags are a result of guessing.
15:11
<Dashiva>
If the stack of open elements does not have an element in scope with the same tag name as that of the token, then this is a parse error; act as if a start tag with the tag name p had been seen, then reprocess the current token.
15:11
<yecril71>
I think they are rather straightforward to invent.
15:11
<yecril71>
And what about <div >&nbsp;</p >?
15:12
<Dashiva>
That's <div>[text:&nbsp;]<p></p> as far as I can see
15:13
<yecril71>
That would be a change from HTML4 IMHO.
15:13
<zcorpan>
yecril71: <p> is not optional
15:14
<Dashiva>
Yeah, it was mandatory in html4 too
15:15
yecril71
is ashamed for talking rubbish
15:18
<yecril71>
You cannot check every input after publishing; there are exponentially many ways to arrange things.
15:19
<yecril71>
Giovanni should be given the title Master of Colourful Narration.
15:20
Philip`
discovers that his view of the world has been entirely wrong
15:20
<Dashiva>
What was it this time, Philip`?
15:20
<Philip`>
I thought calling charsUntil() would be faster than repeatedly calling char(), but it turns out that it's not :-(
15:20
<Philip`>
at least for short strings
15:20
<Philip`>
like tag names
15:21
<jgraham>
Oh. That sucks. Can't we make charsUntil faster
15:21
<jgraham>
?
15:21
<Philip`>
If I'm not doing something stupid, removing the charsUntil from tagNameState makes parsing go ~3% faster
15:22
<Philip`>
jgraham: It seems there's a tradeoff in charsUntil between being efficient for short strings and being efficient for long strings
15:22
<yecril71>
I actually need HTML.
15:23
<Philip`>
Currently it constructs a regex for each set of characters, and then matches that against the input document
15:23
<yecril71>
Typing </td ><td > and </li ><li > makes me sick.
15:24
<Philip`>
which I suppose is adding some fixed overhead that doesn't pay off when you're typically going to be finding two characters
15:24
<Philip`>
(though it's very helpful in dataState, where you might be finding hundreds of characters before the next '<' or '&')
15:24
<jgraham>
Philip`: How about a non-regexp version of charsUntil that just has a while loop or something?
15:25
<Philip`>
jgraham: How would we decide which version of charsUntil to call?
15:27
<jgraham>
Philip`: Assume that character strings are typically long and markup strings are typically short?
15:28
<Dashiva>
data-* (and maybe aria?) might be trouble
15:28
<jgraham>
Dashiva: They are rather uncommon though
15:33
<Philip`>
Looks like the cross-over point is when you're matching six characters
15:33
<Philip`>
(Below that, a loop is faster; above that, the regex is faster)
15:40
<zcorpan>
so use loop for tag names and attribute names, maybe unquoted attribute value...?
15:41
<Philip`>
I think I'd prefer the idea of gathering statistics rather than guessing :-)
15:42
<Philip`>
Only problem is I don't know how to do that
15:42
zcorpan
was looking at http://canvex.lazyilluminati.com/survey/2007-07-17/analyse.cgi/index
15:42
<Philip`>
Does Python provide an easy way to find your caller's function name?
15:43
<Dashiva>
When I tried, it was all hacky and ugly, looking at stack frames
15:45
<Philip`>
sys._getframe(1).f_code.co_name - that seems to work
15:47
jgraham
would just have parsed something and done the stats on the resulting tree
15:48
<jgraham>
(But I would like an instrumented version of html5lib that showed the call sequence and arguments for any input so that you could tell what was going wrong)
15:48
<jgraham>
In fact I sort of wrote that once for making graphs of the phase transitions but I deleted it because it was even more ugly than normal
15:49
<Philip`>
jgraham: The tree doesn't give you the right information about charsUntil
15:50
<Philip`>
e.g. it won't tell you that the whitespace bit in dataState will read an average of 3.3 characters
15:50
<Philip`>
(because in the tree it'll be merged with all the other characters)
15:53
<jgraham>
Philip`: Right, it maybe doesn't give you everything you want but it allows determinations of the length of e.g. comments, tag names, etc.
15:54
<jgraham>
You are also limited by the input data that you measure, regardless of what the method is
15:55
<zcorpan>
jgraham: Philip` only cares about the html5 spec, remember :)
15:56
<gsnedders>
We could just rewrite it to be a C extension
15:56
<gsnedders>
That'd be quick.
15:57
<Philip`>
A pure Python parser is more usable than one that relies on C
15:58
<gsnedders>
Philip`: A C one is quicker.
16:00
<Philip`>
jgraham: Lengths of comments aren't quite the right information to gather - it's the length of the strings between hyphens in comments that matters, since that's what charUntil collects
16:31
<yecril71>
I think user agents that do not expect HTML as their primary medium should not be bound by the HTML specification at all.
16:32
<yecril71>
Like when the content type is something other than text/html.
16:33
<yecril71>
My legal theory is that section 3.3.2 applies only to a transfor served as text/html to a user agent expecting HTML.
16:33
<yecril71>
XSLT processors are unaffected.
16:34
<yecril71>
Why anyone should do this remains a mystery to me.
16:35
<yecril71>
Another possibility is when elements from XHTML namespace are embedded in XML documents that are styled with CSS.
16:36
<yecril71>
In that case, behaviour may be taken from the XHTML definition.
16:36
<yecril71>
That would be true particularly for form controls and scripts, because XML does not allow scripts.
16:37
<yecril71>
(or rather, does not provide for scripts or any kind of interactivity explicitly.)
16:38
<yecril71>
But even then, XSLT has a tendency of scattering the constituents of the output page across templates,
16:39
<yecril71>
so the page as viewed from XSLT would be rather useless.
16:40
<yecril71>
Stylesheet is a legacy word to describe a transformation.
16:40
<yecril71>
It is really not appropriate any more.
16:41
<yecril71>
That does not mean that "transformation sheet" is good.
16:42
<yecril71>
It is a transformation, certainly not a sheet.
16:49
<Hixie>
jgraham: how should i phrase it then?
16:51
<Philip`>
Hmm, turns out that tagNameState is the only thing that gets any practical benefit from not using charsUntil
17:00
<jgraham>
Hixie: I would probably circumvent the problem by using a different example e.g. solving the roots of a quadratic using the quadratic formula
17:00
Philip`
gives up for now, before he introduces more pointless bugs into html5lib
17:01
<jcranmer>
ironic that I was just talking about square roots in another channel
17:01
<jgraham>
Otherwise I would say "using Pythagoras' theroem to solve for the hypotenuse <var>a</var> of a triangle of sides <var>b</var> and <var>c</var>"
17:03
<Philip`>
Wouldn't the quadratic formula take millions of lines of MathML and be impossible to read?
17:03
<Hixie>
jgraham: that works, i'll use that
17:03
<Hixie>
Philip`: yes
17:04
<Philip`>
At least it's possible for a human to read the Pythagoras one
17:04
<gsnedders>
Philip`: How many lines depends on how much whitespace you put in it :)
17:05
Philip`
wonders if text in examples like "flavours of ice cream" should be translated to en-US
17:05
<jgraham>
gsnedders: However the first law of MathML is that the total amount of markup needed to express a given concept is greater than the total amount of markup it was possible to use to express that concept
17:06
<gsnedders>
jgraham: :)
17:06
<jgraham>
s/it was/anyone thought it was/
17:06
<Philip`>
Anyway the example is clearly misleading, because you should expect 2<sup><var>n</var></sup> flavours of ice cream since they can choose whether or not to include each piped-in ingredient
17:06
<jgraham>
(the first law of ontent MathML is roughly similar but with the words "much much greater")
17:16
<yecril71>
You cannot solve roots and you cannot solve a hypotenuse.
17:16
<yecril71>
You can only solve problems, including special kinds of equations.
17:17
<yecril71>
You calculate roots and the length of the hypotenuse.
17:17
<jgraham>
Er yeah, I meant to say "solve /for/ the roots"
17:17
Philip`
appears to be parsing the spec in 12.6s now
17:18
<yecril71>
solve what?
17:18
<jgraham>
for the roots :)
17:18
<jgraham>
Or, if you like, the quadratic
17:19
<yecril71>
Solve what for the roots?
17:19
<jgraham>
The quadtratic, whatever it happens to be
17:19
<jgraham>
*quadratic
17:19
<yecril71>
Quadratic equation?
17:20
<jgraham>
yes. An equation can be solved. A root is a particular point on the solution.
17:20
<jgraham>
Hence "solving for the roots is both common and makes sense :)"
17:20
<yecril71>
A root is a particular member OF the solution.
17:20
<yecril71>
The solution is a set of numbers, and calling them points is an exaggeration.
17:21
<yecril71>
"solving for the roots" misses the object.
17:21
<Philip`>
It seems equally valid to see the solution as a (discontinuous) line
17:22
<yecril71>
That is redundant information.
17:22
<yecril71>
{(x1, 0), (x2, 0)} is much better handled as {x1, x2}
17:23
<jgraham>
Well the roots are a special case you could equally say solve y=f(x) for y=5 or whatever
17:23
<jgraham>
s/case/case;/
17:24
<Philip`>
yecril71: There's no reason for a line to exist in a two-dimensional space
17:24
<Philip`>
s/exist/only exist/
17:26
<yecril71>
f(x)=5 are not roots. f(x)-5 are roots, but it is a different equation.
17:29
<jgraham>
Hixie: The algorithm for setting document.domain should probably mention the fact that setting document.domain to a tld (or maybe a domain on the public suffix list?) is a security error
17:29
jgraham
should probably send email
17:29
<Hixie>
yeah i'll be adding that. send mail. :-)
17:37
<Philip`>
Hixie: s/theroem/theorem/
18:26
<yecril71>
What does it mean that HTML is not written by hand? I am confused.
18:26
<yecril71>
My idea of writing HTML "not by hand" is to use XSLT or document.write.
18:26
<yecril71>
But the sources for these are again written by hand.
18:32
<jcranmer>
yecril71: WYSIWYG?
18:35
<Philip`>
It means whatever the person using that phrase wishes it to mean, which is usually not quite the same as what anyone else thinks it means
18:41
<gsnedders>
Lachy: replaceHeadings is now in the normal Anolis repo
18:50
<yecril71>
WYSIWYG is deplorable in case of HTML.
18:51
<yecril71>
WYSINWTRG.
18:55
<yecril71>
And the documents thus produced usually have the MOTL.
19:29
<gsnedders>
Hixie: should aside not be a sectioning root?
19:32
<Hixie>
gsnedders: ?
19:33
<gsnedders>
Hixie: surely having headers in aside elements show up in outlines doesn't make sense?
19:34
<Hixie>
why not?
19:34
<Hixie>
"surely" isn't a good argument :-)
19:34
<gsnedders>
"surely".
19:34
<gsnedders>
:)
19:34
<gsnedders>
If you look at the sidebars on most blogs, those should be an aside, right?
19:35
<gsnedders>
Surely headers in them don't make sense in an outline?
19:39
<gsnedders>
oh woops.
19:39
<gsnedders>
I just got pimpmyspec into an infinite loop :)
19:39
<gsnedders>
jgraham: Update the copy of Anolis!
19:41
Philip`
makes html5lib do a lot more memory copying every time you call unget(), which results in 2% faster parsing
20:10
<Lachy>
gsnedders, cool.
22:39
<Philip`>
I love how OpenOffice.org Calc makes it impossible for me to change a cell value from "foo" to "Foo", if there are other cells in the same column that also say "foo"
22:51
<Philip`>
http://news.bbc.co.uk/1/hi/technology/7787335.stm - using Adobe AIR to provide cross-platform DRMed video
22:52
<Philip`>
(I think I'll stick to the tool that lets me download the DRM-free MPEG4 versions that are intended for iPhone users)
23:03
<karlcow>
Philip`: http://www.bbc.co.uk/radio/help/faq/download_and_install_realplayer.shtml BBC provides a special real player. I think they had to do it because of their contract
23:04
<Philip`>
karlcow: That's only for radio, not TV
23:04
<karlcow>
though I don't know if it's still working with bbc programs. I have not used it for a very long time.
23:04
<karlcow>
Philip`: ah thanks :)
23:05
<Philip`>
The iPlayer provides nice simple DRM-free MP3 versions of radio programmes to iPhone users, so the same iPhone-masquerading tool can download them too :-)
23:05
<Philip`>
(or you can just use the streaming Flash player)
23:05
Philip`
has had very limited success with RealPlayer on Linux