00:00
<zcorpan_>
Hixie: feels great :)
00:00
<zcorpan_>
jgraham: ok. thanks
00:01
<Hixie>
hey i guess working for opera also means you get w3c member access
00:01
<zcorpan_>
yeah
00:01
<Hixie>
now you can see the crazyness you've previously only been able to imagine
00:02
<jgraham>
zcorpan_: I think you need to join the html5lib-discuss group to post to it btw.
00:02
<Philip`>
Are you being paid to work on this at 1am? :-)
00:02
<zcorpan_>
Philip`: yep :)
00:02
<zcorpan_>
Philip`: plus, i work from home
00:02
<zcorpan_>
my work day starts when i want and ends when i want
00:03
<Dashiva>
h4x
00:03
<zcorpan_>
which is usually when i wake up and when i go to bed, respectively
00:03
<Dashiva>
We have core time in Oslo
00:05
<zcorpan_>
Hixie: i read the pointers in http://ln.hixie.ch/?start=1172653243&count=1 but i haven't looked at other crazyness
00:05
<Hixie>
btw i'm going to be in oslo (though extremely tired) late next monday and early next tuesday
00:05
<Hixie>
i'll probably pop by the opera offices
00:06
zcorpan_
wonders if anyone will pop by the eskilstuna office
00:07
<Dashiva>
Just as I take two days off. I'm going to miss the munchkin playing, no doubt.
00:11
<zcorpan_>
anything interesting on public-html the past 24h?
00:14
<Hixie>
i just found this interesting tidbit:
00:14
<Hixie>
Tantek Çelik (Microsoft): We are in the XHTML WG. I am the representative; recently it has become clear that the priorities of the XHTML WG are different from our priorities. We would like to see the HTML 4 and XHTML 1.x versions resolved. Most of the folks in the WG are XHTML 2 and that is not a priority for us.
00:14
<Hixie>
from http://www.w3.org/2004/04/webapps-cdf-ws/minutes-20040601.html
00:14
<Hixie>
Steven Pemberton (W3C/CWI): If you want that done, you have to do it.
00:17
<tantek>
Thanks for the memory Hixie :)
00:17
<tantek>
yes, that workshop is where everything "blew up" as the kids say
00:17
<Hixie>
indeed
00:18
<Hixie>
but i didn't realise that steven actually told us to go do html5
00:18
<tantek>
he didn't
00:18
<tantek>
he told you to go do html5, and me to go do microformats
00:18
<tantek>
he just didn't realize he did ;)
00:18
<tantek>
and yes, you're welcome for the setup :)
00:19
<Hixie>
:-)
00:20
<tantek>
out of that workshop i was more convinced than ever that I had to leave microsoft and pursue microformats wherever there was support for them, knowing that you would have a pretty good handle on the HTML 4.x XHTML 1.x updates.
00:24
<tantek>
Hixie, it wouldn't be inaccurate for you to even state that Microsoft's representative to that workshop called for work on HTML4 and XHTML1 along a set of requirements remarkably similar to those adopted by WHATWG.
00:24
<Hixie>
indeed
00:24
<tantek>
thereby confirming all the conspiracy theorists suspicions that WHATWG is merely doing Microsoft's bidding. ;)
00:25
<Hixie>
oh the modern conspiracy theory is that it's google's attempt at getting around the problem that converting adsense to xhtml2 would be too hard
00:25
<zcorpan_>
LOL
01:23
<webben>
Hixie: more vaguely sane long descriptions: http://www.tsu.ox.ac.uk/info/report.php
01:24
<webben>
(although I think they could have madeuse of data tables)
01:25
<webben>
another example: http://docs.sun.com/source/817-5763/
01:26
<webben>
in general, look through this search: http://www.google.co.uk/search?hl=en&q=%22long+description+for%22 for lots of longdesc examples
01:28
<Hixie>
my script uses the same source data as that search, basically
01:38
Philip`
never knew that IE supports <comment>...</comment>
01:39
<Philip`>
(Interestingly the text appears to be not in the DOM, but is in the innerHTML view)
03:09
<Hixie>
heh, i just noticed something about the press release the w3c put out when the charters were announced
03:10
<othermaciej>
yeah?
03:10
<Hixie>
it says:
03:10
<Hixie>
"With the chartering of the XHTML 2 Working Group, W3C will continue its technical work on the language at the same time it considers rebranding the technology to clarify its independence and value in the marketplace."
03:11
<othermaciej>
hah!
03:12
<othermaciej>
"dear xhtml2 wg, how is that rebranding coming along? love, the html wg"
05:37
<hsivonen>
annevk: I meant that when you've got a form control whose form pointer does not point to an ancestor and that doesn't have a form='' attribute pointing to the same node as the form pointer, generate an id attribute on the node pointed by the form pointer if there isn't an id already and generate a corresponding form='' attribute on the form control
05:37
<hsivonen>
annevk: this fails if the <form> element already has an id='' attribute and the value of that attribute is a duplicate
05:49
<hsivonen>
othermaciej: Also I suggested the iterative DOM traversal algorithm to zcorpan, but does IE guarantee that the algorithm terminates? I think it doesn't.
05:53
<othermaciej>
hsivonen: oh - good point, I'm not sure how it works in the face of a non-tree
05:53
<othermaciej>
hsivonen: I'm not sure what exactly IE's non-tree DOMs look like
05:55
<hsivonen>
othermaciej: this is one significant reason why a non-tree DOM sucks
05:58
<othermaciej>
hsivonen: I have seen a look of shocked realization on the faces of JS library authors when they heard that IE can do that
05:59
<othermaciej>
"that explains those weird infinite loop bugs!"
05:59
<othermaciej>
do you actually know what it does though?
05:59
<othermaciej>
is it just the parent pointer that can be wrong? you could work around that with a stack
06:02
<Hixie>
see my blog
06:02
<Hixie>
entries starting with "Tag Soup" iirc
06:02
<Hixie>
bbl
06:09
<hsivonen>
othermaciej: not sure. The edges between EM and ADDRESS in the Mac IE 5 DOM with Hixie's case look like the ingredients of an infinite loop: http://hsivonen.iki.fi/soup-dom/ (I can't test IE6 here.)
06:14
<othermaciej>
good lord, that's insane
06:14
othermaciej
blames tantek
06:15
<othermaciej>
child pointer indicates presence in the childNodes array?
06:16
<hsivonen>
Philip`_: If you'd like to run surveys with something that runs as native instructions at run time, I suggest figuring out which Java spider framework can easily take a plugged HTML5 parser
06:17
<othermaciej>
hsivonen: it looks like traversal via firstChild/nextSibling/parentNode would not infinite loop on that, but it would miss some elements
06:17
<othermaciej>
wait, maybe it wouldn't even iss anything
06:17
<hsivonen>
Philip`_: the parser needs to get a java.io.InputStream, the value of the HTTP charset (null if absent), a SAX ErrorHandler and a SAX ContentHandler (for extracting links)
06:17
<hsivonen>
othermaciej: child is firstchild
06:18
<hsivonen>
othermaciej: IIRC
06:18
<othermaciej>
it can't be only firstChild, since you can't have multiple firstChilds
06:18
<hsivonen>
othermaciej: oh. right. can't rememeber anymore what I did
06:20
<othermaciej>
some nodes would be visited more than once I guess, w/ tree-based traversal
06:21
<othermaciej>
we have some ex-MacIE folks on our team, I could ask them what they were thinking :-)
06:21
<hsivonen>
Philip`_: the Internet Archive spider looks promising, but they seem to rely on the JVM running on Linux with a particular thread impl
06:22
<hsivonen>
Philip`_: btw, I wouldn't run a Java spider that used java.net.URLConnection without socket timeouts
06:22
<hsivonen>
I have more confidence in Commons HTTP Client
06:23
<hsivonen>
I haven't checked which HTTP client the Internet Archive spider uses
07:02
<Hixie>
hm, xmlns="...xhtml" usage has gone up to 20% according to the survey i just did (of several billion html docs)
07:03
<Hixie>
from about 15% about a year ago
07:07
<Hixie>
and 41% have no DOCTYPE, down from about 50% at the same time iirc
07:08
<Hixie>
19% have the XHTML1 DOCTYPE, 11% have a 4.01 Transitional DOCTYPE with no URI
07:09
<Hixie>
6% are 4.01 Transitional with URI
07:28
<Hixie>
and the 0.014% of XHTML usage has gone up to 0.062%
07:29
<hsivonen>
Hixie: real XHTML? as in a/x+x
07:30
<hsivonen>
Amazon EC2 was mentioned earlier. any actual experience with using it?
07:39
othermaciej
is surprised to hear there's that many sites that give the finger to IE; or is that conditionally served?
07:42
<Hixie>
hsivonen: yeah
07:42
<Hixie>
othermaciej: might be conditional, dunno
07:43
<hsivonen>
Hixie: does Google unify multiple representations of a page if it finds foo with Content-Location, foo.html and foo.xhtml?
07:46
<Hixie>
duplicate elimination happens before my script gets hold of the data, yes, but i don't know exactly what gets counted as a dupe
07:48
<hsivonen>
hmm. looks like Google has changed its behavior again and now http://hsivonen.iki.fi/thesis/html5-conformance-checker over .html or .xhtml. IIRC, it returned http://hsivonen.iki.fi/thesis/html5-conformance-checker.xhtml a couple of weeks ago
07:50
<hsivonen>
s/now/now prefers/
07:51
<Hixie>
it probably treats them separately and picks one based on which has the most "relevance"
09:16
<hsivonen>
http://www.w3.org/mid/886507.69879.qm⊙wmryc
09:19
<annevk>
http://lists.w3.org/Archives/Public/www-validator/2007Jul/0011.html
09:19
<zcorpan_>
oh of course. writing your own dtd makes you validate.
09:20
<annevk>
it's true
09:20
<annevk>
it's just not very smart
09:21
<zcorpan_>
might be if you really use validation as qa check, and you don't want to flag files that have 1 error you already know about and have to have around
09:56
<Lachy>
Hixie, yt?
09:59
<annevk>
zcorpan_, http://simon.html5.org/temp/html5lib-tests/dom2string.js doesn't seem to handle attributes
10:00
<zcorpan_>
annevk: oops
10:05
<zcorpan_>
annevk: fixed
10:10
<Hixie>
Lachy: yo
10:11
<Lachy>
Hey Hixie, Marcos and I are working on the XBL Primer, and we're trying to come up with a concise description of what a template is. Any suggestions?
10:12
<Hixie>
it's some markup that will be used to render the bound element, i guess
10:12
<Lachy>
so far we have "A template is used to control the presentation of a document", but we want to say something about how it reorders content in the DOM, without altering it, using shadow trees, but without using technical terms
10:12
<annevk>
interesting, Opera returns uppercase attribute names
10:13
<zcorpan_>
annevk: yeah.
10:13
<Hixie>
Lachy: good luck
10:13
<Lachy>
thanks
10:13
<Hixie>
Lachy: my best attempt is what's in the spec
10:13
<Hixie>
Lachy: in the note in the definition of <template>
10:14
<annevk>
"A template defines the building blocks for the subtree of the bounding element."
10:14
<Lachy>
yeah, that's the problem :-)
10:15
<Lachy>
hmm. we could try and work something like that into it.
10:16
<annevk>
just say something and then illustrate it with some "easy" to grasp examples
10:16
<Lachy>
yeah, that's the idea
10:19
<zcorpan_>
hm. opera can have cdata nodes in the dom. how should i output those?
10:19
<zcorpan_>
"<![CDATA[ " + current.nodeValue + " ]]>" ?
10:21
<annevk>
yeah
10:24
<zcorpan_>
done
10:30
<Hixie>
i'm instrumenting my html parser to report how many times it clones nodes in the AAA and inline-reconstruction algorithms
10:30
<Hixie>
anything else i can instrument while i'm at it?
10:31
<Hixie>
hsivonen? annevk? jgraham?
10:32
<annevk>
we have some XXX comments about tokenization...
10:33
<annevk>
specifically which cases in states are the most frequent
10:33
<annevk>
so you can optimize those cases in some way...
10:34
<annevk>
other interesting things might be <form> nodes <form> where nodes does not include </form> and then do some browser testing on those more complicated examples from real world pages
10:36
<Hixie>
eh?
10:37
<Hixie>
i could emit for each tokeniser state the most common tokens seen, i guess
10:38
<Hixie>
it would make the parser way slower, but it could work
10:38
<annevk>
it's probably not very important
10:38
<annevk>
tree mutation and node duplication are more interesting
10:39
<annevk>
would be fun to count how often you encounter <canvas> nowadays :)
10:41
<Hixie>
i've looked at elements in a separate study
10:42
<Hixie>
canvas didn't appear in the top 200
10:43
zcorpan_
suspects that some <canvas>es are only output with script
10:52
<annevk>
k
10:52
<zcorpan_>
hmm. dom core doesn't specify an order for .attributes ... i need to sort them myself
10:53
<annevk>
I wonder if we have actually sorted them...
10:55
<zcorpan_>
opera and safari don't seem to sort them. ie seems to sort them alphabetically. firefox alphabetically reversed.
10:55
<Hixie>
ok i'm going to emit a list of total count of all the tokens
10:56
<Hixie>
for each kind of token in each insertion mode
10:56
<Hixie>
anything else?
10:56
<Hixie>
last chance before i set this off and go to bed...
10:56
<annevk>
ah, I actually meant characters I think
10:56
<annevk>
but that may be too expensive
10:56
<Hixie>
characters?
10:56
<annevk>
during tokenization
10:56
<Hixie>
how do you mean?
10:57
<zcorpan_>
see how often ">" (with quotes) appears in doctypes or bogus comments
10:57
<annevk>
so you can optimize a particular tokenization state
10:57
<Hixie>
oh i thought you wanted to optimise the tree constructor states
10:58
<Hixie>
zcorpan_: hm
10:58
<hsivonen>
Hixie: hmm. I guess there might be merit in instrumenting how often IN_BODY code runs with the actual insertion mode being one of the table modes other than caption and cell
10:58
<Hixie>
annevk: surely for the tokeniser it makes no difference since you'll just do table dispatch
10:58
<annevk>
IE has this nice <!- .... ">" more comment ... >
10:59
<zcorpan_>
Hixie: http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012078.html
10:59
<Hixie>
hsivonen: you mean an average of times per page that the inbody state is invoked when the state is not inbody, incell, or incaption?
10:59
<hsivonen>
Hixie: is it even important to clone DOM nodes instead of using the attributes on the original token and creating a new DOM node using those?
10:59
<Hixie>
zcorpan_: yeah i'm just trying to work out how to do it
10:59
<hsivonen>
that is, do you really want to close concurrent attribute changes?
11:00
<Hixie>
i don't think the dom supports having attributes shared between nodes
11:01
<hsivonen>
Hixie: yes, the average times the table states actually fall though to in body
11:01
<hsivonen>
through
11:04
<Hixie>
ok, i'm logging the actual insertion mode when my inhead, inbody, and intable functions are invoked
11:04
<hsivonen>
Hixie: since that only happens in non-conforming cases and Java doesn't have goto, I let the code hit some useless branches when the fall-through happens
11:04
<Hixie>
hopefully they map exactly to the spec
11:06
<Hixie>
zcorpan_: for DOCTYPEs we don't care, right? since what the spec does matches IE anyway?
11:06
<hsivonen>
(A smart compiler could fix this, but I doubt javac or hotspot are that smart)
11:06
<annevk>
yeah, DOCTYPEs match IE
11:06
<annevk>
it's just that IE uses the same mode for bogus comments as they use for DOCTYPEs it seems
11:07
<Hixie>
i'm gonna bail on working out what characters are most common in each tokeniser mode, on the principle that there are so few states it hardly matters anyway
11:07
<zcorpan_>
Hixie: not quite. the spec doesn't handle <!doctype ">" >
11:07
<annevk>
oops
11:07
<zcorpan_>
Hixie: the spec only matches ie if the > is in an actual FPI or SPI
11:08
<hsivonen>
Hixie: oh yeah, one more thing for optimization: whether an average stack node is tested for being in a group of element names more than once
11:09
<Hixie>
well i didn't find any DOCTYPEs with > in their name part, at least not enough to appear on my radar in the scan of doctypes i did earlier this week
11:09
<hsivonen>
Hixie: that is, whether it makes sense to have a boolean on a stack node that says for example whether the node is a table context sentinel
11:09
<zcorpan_>
Hixie: ok
11:09
<zcorpan_>
Hixie: isn't that because > in the name part terminates the doctype? :)
11:10
<hsivonen>
Hixie: or whether a stack node should have a flag for phrasing OR formatting OR div OR address
11:10
<Hixie>
sorry, i meant "
11:10
<zcorpan_>
ah
11:10
<zcorpan_>
ok
11:10
<Hixie>
hsivonen: so what i did with that is that each well-known tag name has an integer associated with it (like an atom) and for each special feature that the parser cares about i used a bit
11:11
<Hixie>
i used 24 bits for these flags
11:12
<Hixie>
so for example all the <hx> elements have the number 0x400008400000
11:12
<hsivonen>
Hixie: my strategy is to intern well-known names so that testing against one name is a comparison of memory addresses but still testing if a name is in a group means as many comparisons as names names in group
11:12
<Hixie>
the leading 0x4 is "element" (as opposed to text node), the 8 is "hx node", and the 4 is "closes <p> elements"
11:13
<Hixie>
yeah so my parser never compares tag names once they're in the stack
11:13
<Hixie>
doing string compares was prohibitively expensive
11:13
<hsivonen>
interesting
11:13
<Hixie>
i just use the integer that says whether a node is a text node, comment node, doctype, etc, to say what special kind of element it is too
11:14
<Hixie>
and so everything is always exactly one & and exactly one ==
11:15
<annevk>
and you construct those numbers during tokenization?
11:15
<hsivonen>
I guess I'll complete the tree builder with my current approach and will leave a tokenizer-assigned bitfield as a later interface-breaking optimization
11:16
<Hixie>
annevk: whenever i create a node, i create it withe the appropriate constant
11:16
<Hixie>
the tokeniser doesn't know about these
11:16
<Hixie>
it emits tokens with tag names
11:16
<Hixie>
it's only when i create nodes that i use these
11:16
<hsivonen>
Hixie: ooh. so "closes p" is not assigned in the tokenizer after all
11:16
<annevk>
ok, so the tree construction stage does use string comparison?
11:17
<Hixie>
yeah, tokens are string-compared
11:17
<Hixie>
but i think my compiler might be atomising them
11:17
<Hixie>
so it's not such a big deal
11:19
<hsivonen>
I'm currently using the generic String.intern(), but I figured how to make a fast interning function with knowledge about the possible names (three-level switch: length, last char, second to last char)
11:19
<hsivonen>
but typing that is too much work
11:19
<hsivonen>
so I guess I'll write a small Python program that generates Java code for the interning function at some point
11:20
<Hixie>
zcorpan_: given that only IE does this, I'm going to assume it's not a big deal. I can investigate it in more detail later maybe. Don't want to hack the parser too much tonight. :-)
11:20
<Hixie>
beware that the names are unbounded
11:20
<Hixie>
<fiv> is an element name that is seen in the wild, e.g.
11:20
<Hixie>
you don't want to treat it as <div>
11:21
<Hixie>
especially in your case :-)
11:22
<hsivonen>
Hixie: of if the length is > 2, the prefix needs to be compared, too, to make sure
11:22
<hsivonen>
Hixie: still better than an intermediate copy to java.lang.String
11:23
<hsivonen>
Hixie: the idea is to weed out all but one prefix candidate
11:23
<Hixie>
ah cool
11:28
<Hixie>
right sleep time
11:28
<Hixie>
nn
11:28
<hsivonen>
nn
12:18
<zcorpan>
the parser test format doesn't distinguish between an "" attrubute and a text node "=" (e.g.: <p "">"="</p>)
12:18
<zcorpan>
| <p>
12:18
<zcorpan>
| ""=""
12:18
<zcorpan>
| ""=""
12:19
<annevk>
that's not too relevant though
12:19
<annevk>
but an interesting edge case
12:20
<zcorpan>
perhaps " in text nodes should be escaped with \?
12:20
<annevk>
why?
12:21
<zcorpan>
so you can tell the difference between attributes and text nodes. but perhaps it doesn't matter
12:22
<annevk>
just don't mix them
12:24
<annevk>
also, if you make mistakes in your parser at that level you've got bigger issues :)
12:25
<zcorpan>
which parser?
12:25
<annevk>
HTML parser?
12:25
<zcorpan>
ah. yeah.
12:34
<Philip`_>
hsivonen: I think it might be reasonable to keep the spidering and parsing completely separate, so they could be different languages (depending on what useful tools are available for), just communicating asynchronously through some database (which is probably necessary anyway to support parallelism)
12:47
<hsivonen>
Philip`_: I've never done wide-scale spidering. however, I would think that sticking stuff in a database in between would slow things significantly compared to the parser reading from the real socked when the spidering happens (possible with e.g. Commons HttpClient)
12:49
<hsivonen>
to me, it seems that the obvious way to implement this is to have a number of worker threads that run both the parser and the HTTP client and request URLs and report results to a centralized thread-safe coordination object
12:49
<hsivonen>
s/socked/socket/
12:51
<hsivonen>
as for tools in different languages, if you can't make everything run on a JVM, communicating through a local socket is more efficient that having an persistence layer in between
12:51
<hsivonen>
I am assuming here that we don't want to keep copies of the spidered bytes
12:52
<Philip`_>
It would be useful to allow the thing to run on multiple computers to spread the load out, and then it would need some network communication for coordination instead of just threads
12:53
<hsivonen>
Philip`_: it might be worth investigating if instead of running a spider we should run on EC2 and read the latest Alexa spireding dump from S3
12:53
<Philip`_>
(I'm kind of thinking about multiple computers on a LAN with a fast internet connection, so the network wouldn't be a bottleneck when spreading stuff out)
12:54
<hsivonen>
I poked around the Amazon docs but I didn't find out if the Alexa dump can be easily read by URL instead of by handle obtained from Alexa search results
12:54
<Philip`_>
That sounds like a useful thing to investigate
12:55
<hsivonen>
Philip`_: anyway, you definitely want to keep the JVM up and running with multiple threads reading from sockets instead of invoking it again and again
12:55
<hsivonen>
I don't know where the other end of those sockets should be
13:00
<Philip`_>
Perhaps the hardest bit is working out which pages to look at so that the sample is biased sensibly - I assume normal spiders just try to grab as much stuff as possible, which is not useful since they'll spend far too long in a few large sites
13:01
<hsivonen>
yeah, I think in principle we want to look at the Web breadth first, but not just front pages
13:01
<Philip`_>
and I would expect it's not possible to grab a large enough sample to do something like PageRank to find the interesting pages
13:05
<Philip`_>
(though maybe it wouldn't be too rubbish to just use the process which the original PageRank is modelling, where you follow random links and have a ~15% chance of getting bored and jumping to some other arbitrary page)
13:07
<hsivonen>
cool. the IA crawler uses Commons HttpClient
13:18
<hsivonen>
Philip`_: I encourage you to take a look at http://crawler.archive.org/
18:27
<annevk>
http://html5.org/parsing-tests/testrunner.htm
18:30
<annevk>
lots of browser backing for ignoring </head>
18:31
<annevk>
but I guess that was already known
18:32
<annevk>
I suppose next would be some prefs so you can ignore IE <title> insertions
19:21
<jgraham>
annevk: re: running python on my web server; the short answer is that I can't (that was in response to your message a few days ago)
19:43
<annevk>
jgraham, are you a registered user?
19:43
<annevk>
Philip`, zcorpan, you can now filter with http://html5.org/parsing-tests/testrunner.htm as well for IE specific quirks
19:46
annevk
wonders what tantek will do next
19:54
<annevk>
Setting the flag makes a lot more pass in IE and Opera. Mostly because IE messes up both DOCTYPE and inserts <title> and because Opera does not include DOCTYPE at all
19:55
<annevk>
It also helps some for Firefox which always uppercases the tag name in the DOCTYPE
19:56
<jgraham>
annevk: Of freenode? No
20:11
<zcorpan>
annevk: nice!
20:17
<annevk>
I fixed some further bugs and I'm going home now
20:18
<annevk>
I'll commit it tomorrow to one of the open source thingies we have
20:18
<zcorpan>
ok
20:18
<annevk>
now someone can write python scripts to iterate over those numbers browsers return...
20:28
<Hixie>
of the 50 or so sites I found with cycles in the headers="", all but three are government sites
20:42
<mpt>
How does that compare with the proportion of government sites without cycles in the headers?
20:42
<mpt>
(Not that I'm interested, it's just the basic "compared to what?" question)
20:54
<Hixie>
mpt: the fact that it's 50 basically means it's an insignificant number that have cycles
20:58
<mpt>
ok
20:59
<Hixie>
http://sixstar.cca.gov.tw/community/pages/01_about_people.php?CommID=1231&ID=1
20:59
<Hixie>
it's so hard to argue that that is a valid use of headers=""
20:59
<Hixie>
sigh
21:00
<Hixie>
with my proposed heuristic for the top left cell, if they changed that into an actual table it would actually work fine with implied scope=s
21:03
<hsivonen>
Hixie: btw, shouldn't scope be down, up, right, left (not row/column)
21:04
<hsivonen>
Hixie: if you have to rows of headers where the upper row applies to the lower row but not vice versa, shoudn't scope be down instead of column?
21:06
<hsivonen>
An end tag whose tag name is one of: "p", "br" is weird to have in "in head noscript"
21:09
<zcorpan_>
hsivonen: why?
21:10
<Hixie>
hsivonen: the values come from html4
21:10
<hsivonen>
zcorpan_: other stray end tags get ignored
21:10
<hsivonen>
Hixie: I know that excplicit one come from there but implicit ones don't have to
21:10
<zcorpan_>
hsivonen: not </p> or </br>
21:11
<hsivonen>
zcorpan_: yeah. like I said, weird
21:11
<Hixie>
hsivonen: there's only one implicit one, "auto", and it has no keyword
21:11
<zcorpan_>
hsivonen: not specific to in noscript in head though
21:14
<Hixie>
wow, some (very few) of the pages caused the AAA algorithm to create over 1000 clones for one stray end tag
21:16
<hsivonen>
Hixie: I hope that doesn't count as a reason to redesign the algorithm
21:16
<Hixie>
no, it's expected really
21:16
<hsivonen>
Hixie: what Safari does on those pages? what about Firefox or Opera?
21:16
<Hixie>
no idea, dunno which pages it is
21:17
<Hixie>
355 billion invokations of the AAA algorithm resulted in zero clones
21:18
<Hixie>
715 thousand invokations resulted in one clone
21:18
<Hixie>
er sorry
21:18
<Hixie>
715 million
21:18
<Hixie>
55 million resulted in 2 clones
21:18
<Hixie>
10 million, 3 clones
21:18
<Hixie>
3 million, 4 clones
21:19
<Hixie>
800 thousand, 5 clones
21:19
<Hixie>
460000 6 clones
21:19
<gsnedders>
Hixie: 1 billion == 1 million million or 1 thousand million?
21:19
<Hixie>
237000 7 clones
21:19
<Hixie>
US billion, thousand million, 1e9
21:20
<Hixie>
less than 100,000 instances of hte AAA algorithm resulted in 11 clones
21:20
<Hixie>
i guess i should have gotten the total count
21:20
<hsivonen>
Hixie: cool. are you going to post this to public-html?
21:20
<Hixie>
to make this a useful number
21:20
<Hixie>
in due course
21:21
Philip`
finds that writing the HTML5 tokeniser as an OCaml data structure and then printing C++ from it is perhaps slightly crazy, but doesn't seem entirely infeasible (though I've only got about a quarter of two states implemented so far...)
21:22
<Hixie>
wait this can't be right, according to separate data, there were only 900,000,000 invokations of the AAA
21:22
<Hixie>
oh, wrong number
21:22
<Hixie>
phew
21:35
<hsivonen>
Hixie: I forgot to ask you this when you asked about instrumentation but did you record data on stack depth?
21:36
<Hixie>
yeah but it's biased because my parser bails after 64k elements
21:37
<hsivonen>
Hixie: what did you find?
21:37
<Hixie>
http://freechal.com/banilaB8 was one of the worst pages
21:37
<Hixie>
(that my parser didn't bail on)
21:37
<hsivonen>
Hixie: so you use a hard limit as well ;-)
21:38
<Hixie>
well i run out of bits to store the pointer in after 64k
21:38
<hsivonen>
the pointer?
21:38
<Hixie>
i have 64 bits to store the length of the text node, the offset of the text node, the pointer to the parent element, and some bits for e.g. if it's a comment node or a text node
21:39
<Hixie>
and the bit that points to the parent element has to also sit alongside the 24 bits i use for the element flags
21:39
<Hixie>
anyway
21:40
<Hixie>
the 50th percentile of the pages my parser didn't bail on had 16 or fewer nodes in its stack at the biggest point
21:40
<Hixie>
99th percentile had 40 or less
21:40
<Hixie>
100th percentil had 64k
21:40
<hsivonen>
Hixie: thanks
21:40
<Hixie>
i can get you more later but i really have to go shower
21:41
hsivonen
does new StackNode[64]
21:41
<Hixie>
heh
21:53
<Hixie>
incidentally, the reason i used 64k as my limit is that i'm having to balance the number of text nodes with the number of elements
21:53
<Hixie>
right now my text nodes are 32k max each
21:53
<Hixie>
i could make them 16k each but have 128k elements, but it turns out that, anecdotally, to process any significantly greater number of pages, i'd have to add many many bits
21:53
<Hixie>
like 4, or 5
21:54
<Hixie>
whereas there are many pages with more than 32k characters at once
21:54
<Hixie>
i suspect that the pathological cases with deep stacks are all cases of bad interactions with AAA
21:57
Philip`
wonders why Opera says "XML parsing failed" when loading http://html5.org/parsing-tests/data/tests3.dat
21:58
<Philip`>
Oh, how odd, it works when I reload...
22:01
<zcorpan_>
Philip`: because it thinks anything loaded through XHR is XML
22:01
<zcorpan_>
Philip`: and then remembers that
22:01
<Hixie>
bbl
22:03
<Philip`>
zcorpan_: Ah, that seems to make as much sense as could be expected
22:08
<hsivonen>
do these statements have a significant difference "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until an element with that tag name has been popped from the stack." and "If the stack of open elements has an element in scope with the same tag name as that of the token, then pop elements from this stack until the stack no longer has an element with the same tag nam
22:09
<Hixie>
yes
22:09
<hsivonen>
ok
22:09
<Hixie>
it differs if the stack has two elements of that name in it
22:09
<Hixie>
e.g.
22:09
<Hixie>
<div><div>
22:09
<Hixie>
however typically the second wording is only used for elements that can't be twice on the stack
22:09
<Hixie>
in which case it doesn't matter
22:10
<hsivonen>
Hixie: how do you get two nested <p> elements is scope?
22:10
<Hixie>
i don't think you can
22:11
<hsivonen>
Hixie: ok. thanks. I'll send email. Every time you use a different wording for no good reason, I have to stop and think. :-)
22:12
<Hixie>
thinking is good! :-)
22:13
<Hixie>
bbl