| 00:24 | <rubys> | jgraham: ping? |
| 00:27 | <Hixie> | if you have what you think is a tree, in the form of a list A of mappings from one node to a list of nodes all of which are in list A |
| 00:27 | <Hixie> | is there a way short of walking the entire tree to verify that the list is indeed a tree and that there are thus no loops? |
| 00:29 | <othermaciej_> | there probably is, based on what graph properties make the graph a tree |
| 00:29 | <othermaciej_> | to be a tree you need to be not just cycle-free but also have exactly one directed edge pointing to each node (except the root) |
| 00:29 | <Hixie> | i guess i don't mean a tree, i mean a directed graph |
| 00:30 | <othermaciej_> | directed acyclic graph? |
| 00:30 | <Hixie> | right |
| 00:30 | <Hixie> | basically a have a list of table cells, each of which can be the header (through headers="") for zero or more other cells, and each of which can have zero or more header cells for itself |
| 00:30 | <Hixie> | but there mustn't be any loops |
| 00:31 | <othermaciej_> | let me look it up in my CLR |
| 00:31 | <Hixie> | i mean i'll do the full walk if there's no quicker way |
| 00:31 | <Hixie> | (memory is no object) |
| 00:31 | kingryan | thinks that's the only way |
| 00:32 | <kingryan> | you might be able to cache some of it, though |
| 00:32 | <othermaciej_> | I don't even know what you mean by "full walk" |
| 00:32 | <othermaciej_> | you'd have to walk every possible path, not just visit every node once |
| 00:32 | <othermaciej_> | if you are really brute forcing it |
| 00:32 | <Hixie> | yeah |
| 00:32 | <othermaciej_> | you'd have to show all paths through the graph terminate |
| 00:34 | <othermaciej_> | Hixie: iteratively removing nodes with no outgoing edges is one way |
| 00:35 | <Hixie> | ok screw this. i don't HAVE to check that headers="" don't form loops |
| 00:35 | <othermaciej_> | Hixie: you'd want a hashtable from node to nodes it points to, and one the other way |
| 00:35 | <Hixie> | at least not in the first pass |
| 00:36 | <kingryan> | Hixie: you only need to check them if you're going to be walking them (check to avoid inf. loops) |
| 00:36 | <Hixie> | yeah |
| 00:36 | <Philip`> | I think you could do a topological sort |
| 00:36 | <Hixie> | which i don't |
| 00:36 | <Philip`> | which'll tell you if it's got any cycles |
| 00:36 | <Hixie> | but i was hoping to be able to see how many pages had that problem |
| 00:36 | <othermaciej_> | Philip`: I'm not sure the obvious topological sort algorithms will terminate in finite time |
| 00:37 | <othermaciej_> | on a graph with cycles |
| 00:37 | <othermaciej_> | since topological sorts are desgined to work on a DAG |
| 00:37 | <Philip`> | You can just do a depth-first search - start with each node being white, mark each one as grey when you recurse into it, mark it as grey when you recurse back out, and if you ever follow an edge into a grey node then there's a cycle |
| 00:38 | <Philip`> | Uh |
| 00:38 | <Philip`> | *mark it as black when you recurse back out |
| 00:38 | <othermaciej_> | that works |
| 00:38 | <othermaciej_> | hmm wait |
| 00:38 | <Philip`> | (You can do some thingy with numbering nodes as you turn them black, to get a topological sort, I think) |
| 00:38 | <othermaciej_> | I'm not sure it works |
| 00:39 | <othermaciej_> | not obvious to me that a cycle couldn't be observable only by visiting a black node |
| 00:42 | <othermaciej_> | DFS can detect cycles by identifying back-edges |
| 00:43 | <othermaciej_> | your algorithm is right |
| 00:44 | <othermaciej_> | I guess that would run in O(E) where E is the number of edges |
| 00:44 | <othermaciej_> | which seems like the best you could do |
| 00:47 | <Hixie> | and it'll work whatever order i do the nodes in, as far as i can tell |
| 00:47 | <Hixie> | which is useful |
| 00:47 | <Hixie> | in my case |
| 00:50 | <Philip`> | I think I can convince myself it's right by saying that if there is a cycle, then when the DFS reaches some node N in that cycle, it will not mark the node as black until either it has reached another grey node (and found a cycle) or has searched the whole cycle and got back to N (which is grey, so it finds the cycle) or has reached a black node in the cycle; and there can never be a black node in the cycle, because the cycle will be detected before an |
| 00:50 | <Philip`> | ...before any node in the cycle is marked as black |
| 00:52 | <Philip`> | I guess you have to do something to make sure the DFS covers all the nodes (by repeatedly DFSing from some arbitrary remaining white node, until there are none) |
| 00:53 | <Hixie> | yeah i'm just going to go through every node with at least one outgoing edge (since i have to visit them anyway for unrelated reasons) and if it's white, i do the search |
| 00:53 | <Philip`> | It should be O(V) rather than O(E) because it'll never visit one node more than once |
| 00:54 | <Philip`> | except I'm probably confused and it's O(E) too, so it's more like O(min(V, E)), not that anybody actually cares, since V =~ E anyway for non-crazy graphs |
| 00:55 | <Hixie> | this is where i find out there's only 5 tables on the whole web with a headers="" attribute and therefore it could be O(N^4) and still complete in finite time |
| 00:59 | <othermaciej_> | Philip`: it has to traverse every edge at least once to see the color of the node at the other end |
| 01:00 | <othermaciej> | Philip`: but I guess it's O(V+E) since you need to visit disconnected nodes too |
| 01:00 | <Philip`> | Got to be careful in case you stumble across some gigantic table with hundreds of rows and columns that's been made accessible with (buggy) headers, since that might cause an O(N^4) algorithm to take a second or two |
| 01:01 | <othermaciej> | actually I guess you don't since Hixie's data structure only represents edges |
| 01:01 | <othermaciej> | hundreds could be worse than a second or two with an O(N^4) algorithm |
| 01:01 | <othermaciej> | N^4 gets bad pretty quickly |
| 01:01 | <Philip`> | Oh, whoops, I forgot it'd still have to look along all the edges to already-black nodes |
| 01:02 | <Hixie> | yeah N^4 is insanely bad if you've got anything of any kind of size |
| 01:02 | <Philip`> | 100^4 = 10^8 which isn't all that bad if you're just following a few pointers :-) |
| 01:03 | <Hixie> | sadly i have to do a string lookup on every single one of these edges :-) |
| 01:03 | <Hixie> | (of course if it's bad, i'll optimise it more. we'll see) |
| 01:04 | <Philip`> | You could do an O(E) preprocessing step to do all the string lookups per edge, before doing the horribly inefficient but highly optimised O(N^4) cycle-finding algorithm on it :-) |
| 01:04 | <Hixie> | indeed |
| 01:08 | <othermaciej> | DFS isn't that hard to code, doesn't seem like a big deal |
| 01:08 | <Hixie> | indeed |
| 01:08 | <Hixie> | and you'll be glad to know it works |
| 01:08 | <Hixie> | sweet |
| 01:08 | <othermaciej> | nice |
| 01:09 | <Hixie> | it tested my three test tables in 0.244s including compiling the program and parsing the html |
| 01:10 | <Hixie> | and given that it took 0.245s to do the same program with only one empty test file... |
| 01:10 | <othermaciej> | it runs in negative time! |
| 01:11 | <Hixie> | and y'all were worried about it being slow! |
| 01:11 | <kingryan> | O(-N^4) ? |
| 01:15 | <Philip`> | Give it a really big table to test, and see if it returns the answer before you've even started the program |
| 01:37 | <Philip`> | Hmm, just remembered a slower but simpler way to find cycles: use a kind of negated variant of Bellman-Ford, by initialising every node's 'distance' value to 0, then setting v.distance=max(v.distance, 1+u.distance) for each edge (u,v), then repeating num_nodes+1 times, and if any has distance=num_nodes+1 then there's a cycle |
| 01:40 | <Philip`> | ...or is that totally rubbish and wrong? I'm not quite sure now |
| 02:02 | <Hixie> | hsivonen: please confirm that since the last time i checked about your parsing e-mails, you have sent only one further message (about <select>) |
| 02:21 | <Hixie> | holy crap, according to this nearly half of all tables with headers="" have a cycle |
| 02:21 | <Hixie> | that seems unlikely |
| 02:22 | <Hixie> | in fact of 60,000 tables with headers="" that i just parsed, only 194 came out without some sort of error |
| 02:22 | <Hixie> | and of those, 177 didn't need headers="" at all because scope="" got the same effect |
| 02:23 | <Hixie> | leaving 17 tables out of 60,000 with headers="" (in just over 100,000,000 documents total) that used headers="" in a non-trivial yet correct way |
| 02:23 | Hixie | looks at those 17 tables |
| 02:24 | <Hixie> | one of them was the table on http://cgi.ebay.ie/Nokia-6210-unlocked-battery-charger-WARRANTY_W0QQitemZ200124682259QQihZ010QQcategoryZ3312QQcmdZViewItem |
| 02:24 | <Hixie> | and it only uses headers with the empty string as its value |
| 02:24 | <Hixie> | maybe i should exclude those, huh |
| 02:24 | <Hixie> | in fact 9 of these were variants on that ebay page |
| 02:25 | <othermaciej> | would that require assuming no header is a header for that call? |
| 02:25 | <Hixie> | my headers="" algorithm used nothing but headers="" to assign headers to cells |
| 02:25 | <Hixie> | so <th> elements have no effect when headers="" is specified |
| 02:26 | <othermaciej> | what I'm wondering is, whether that is the specified behavior for headers="" |
| 02:26 | <Hixie> | in html4? |
| 02:27 | <othermaciej> | yeah |
| 02:28 | <Hixie> | ok i clearly need to look for tables with only blank headers="", since all but one of these uses of headers="" that different from scope="" are blank headers="" only. |
| 02:28 | <othermaciej> | I guess HTML4 is not very clear on it |
| 02:28 | <Hixie> | (http://www.bls.gov/oco/cg/cgs041.htm being that page) |
| 02:30 | <Hixie> | and that page only uses headers="" to associate <th>s with parent <th>S |
| 02:30 | <Hixie> | it doesn't actually do anything to make the table accessible as far as i can tell |
| 02:32 | <othermaciej> | that's a pretty poor record |
| 02:32 | <Hixie> | i'm skeptical of the large number of loops |
| 02:32 | <Hixie> | that seems unlikely |
| 02:32 | <othermaciej> | .3% of usage being error-free seems pretty damn low, even by the already low standards of most HTML features |
| 02:33 | <othermaciej> | that does sound suspicious (the number of loops) |
| 02:33 | <Hixie> | i also scanned longdesc="" in the same survey. i had my script throw out obviously invalid uses of longdesc="", like pointing to a file that the parent <a href=""> points to. |
| 02:34 | <Hixie> | doing a spot check of the pages that came up as "good" uses, one was pointing to the same file, and another was pointing to a file that was the destination of a 301 redirect of a parent <a href=""> |
| 02:54 | <Hixie> | wow, longdesc is a disaster zone far worse than i had imagined |
| 02:57 | <Hixie> | many of these are just pointing to the root of the site! |
| 02:57 | Hixie | adds another heuristic to look for that |
| 02:57 | <Hixie> | lol, the longdesc="" on http://www.felicieditore.it/ points to http://www.felicieditore.com/, which doesn't exist |
| 03:00 | <Hixie> | http://7mobile.de/shop/select?id=101787&v=010000 is a longdesc disaster in so many ways |
| 03:06 | <Lachy> | Hixie: is it looking so bad for headers and longdesc that you're going to consider leaving them out? |
| 03:08 | <Hixie> | i'm going to _consider_ leaving them out just like i'm going to consider leaving them in |
| 03:09 | <othermaciej> | right now it's looking kind of bad for headers even on just a "degrade gracefully in current versions of the #2 screen reader" basis |
| 03:09 | <Lachy> | ok. Maybe you could put them in, and include some algorithm to determine when it should be ignored due to it containing an illogical value |
| 03:09 | <othermaciej> | which I think was the best argument in its favor |
| 03:10 | <othermaciej> | if Hixie's data about how many uses are invalid holds up, anyway |
| 03:10 | <Hixie> | yeah i'm getting a sample of those with cycles to check that |
| 03:15 | <Hixie> | i think it's fair to say that no valid longdesc will ever point to the root of a domain, right? |
| 03:17 | <Hixie> | oh crap, missed dinner. bbl. |
| 04:03 | <Hixie> | ok there's definitely something wrong with the cycle detection |
| 04:14 | <othermaciej> | I think I found a mistake in CSS 2.1 (at least in the November 2006 WD) |
| 04:15 | <othermaciej> | is there any way to see a newer editor's draft so I can check if it is fixed before I report it? |
| 04:15 | <Hixie> | http://www.w3.org/Style/Group/css2-src/cover.html |
| 04:15 | Hixie | fixes the bug |
| 04:16 | <Hixie> | i was indexing using the wrong variable. duh. |
| 04:16 | <othermaciej> | can you check for me if this is really a mistake before I make an ass of myself |
| 04:16 | <othermaciej> | http://www.w3.org/Style/Group/css2-src/visufx.html says, about overflow, "It affects the clipping of all of the element's content except any descendant elements (and their respective content and descendants) whose containing block is the viewport or an ancestor of the element." |
| 04:16 | <othermaciej> | but obviously that is not supposed to apply to overflow on the viewport itself |
| 04:16 | <Hixie> | what's the error? |
| 04:17 | <othermaciej> | right? |
| 04:17 | <Hixie> | right, the viewport is not an element |
| 04:18 | <othermaciej> | ok, maybe just a lack of clarity, not an error |
| 04:18 | <othermaciej> | since if you interpret it that way, it doesn't say anything about how to clip for overflow on the viewport |
| 04:18 | <Hixie> | that sentence doesn't really say anything about anything |
| 04:20 | <othermaciej> | later examples seem to assume it is saying something |
| 04:20 | <Hixie> | yeah, css2.1 is only marginally better than html4 in terms of spec quality |
| 04:27 | <othermaciej> | ok maybe I won't bother with this, even though it was confusing to me, the actual behavior seems to be interoperable |
| 05:02 | <Hixie> | Lachy: yt? |
| 06:26 | <Hixie> | every page i've checked so far that has non-redundant headers="" actually uses them incorrectly. |
| 06:27 | <Hixie> | although maybe we need a heuristic for the top-left cell |
| 06:45 | <Hixie> | ok i finally found a page with a real longdesc="" |
| 06:45 | <Hixie> | http://www.britanniarescue.com/about/strategy/ |
| 06:45 | <Hixie> | http://www.britanniarescue.com/online/longdesc/index.php#BRlogo |
| 06:46 | <Hixie> | the longdesc is inaccurate, and it would be more useful for the information in that file to be in alt="" text anyway |
| 06:59 | <Hixie> | longdesc="mailto:trustee⊙nc" |
| 06:59 | <Hixie> | wtf |
| 07:25 | <hsivonen> | Hixie: confirmed only one additional email |
| 07:28 | <Hixie> | thanks |
| 07:28 | <Hixie> | just making sure none of your mails fall through the cracks when i speed-read the html list... |
| 07:55 | <hsivonen> | Hixie: should I CC you next time? |
| 07:56 | <Hixie> | no, it's ok |
| 07:56 | <Hixie> | just making sure |
| 07:56 | <hsivonen> | ok |
| 07:57 | <hsivonen> | on the face of it, http://www.britanniarescue.com/about/strategy/ seems to have decorative images. why do they bother with longdesc? |
| 07:57 | <Hixie> | i just select all mail to html and read it, then select all mail to the next list and read it, etc |
| 07:57 | <Hixie> | i have no idea why they use it |
| 07:57 | <Hixie> | probably because It's The Law |
| 07:58 | <Hixie> | after looking at all this in more detail, i'm starting to suspect that the accessibility advocacy has maybe done more damage than help, sadly |
| 07:59 | <hsivonen> | yeah. in some twisted way it seems to me that by speccing accessibility features we might actually create lawyerbombs :-( |
| 08:20 | <Lachy> | Hey Hixie, I'm here now |
| 08:21 | <Hixie> | hey |
| 08:22 | <Hixie> | i found a workaround around whatever it was i was going to ask you |
| 08:22 | <Hixie> | which i've forgotten now |
| 08:22 | <Lachy> | ok, no worries |
| 08:23 | Lachy | is off to see the Transforms movie now |
| 08:23 | <Hixie> | aha, the next wave of data is in |
| 08:23 | <Lachy> | *Transformers |
| 08:23 | Hixie | examines |
| 08:25 | <Hixie> | lol |
| 08:25 | <Hixie> | one of the longdesc=""s points to a file called spacer.txt |
| 08:25 | <Hixie> | i have my doubts about the usefulness of THAT longdesc |
| 08:29 | <Dashiva> | How excellent, an accessible spacer gif |
| 08:29 | <Hixie> | there are 8 times more longdesc=""s that point to the same page as an ancestor <a href=""> than there are longdesc=""s that didn't get caught on any of my "likely to suck" heuristics |
| 08:30 | <Hixie> | and out of 8 million <table>s with a cell with a headers="" attribute, twenty thousand had a cycle in the headers="" |
| 08:30 | <Hixie> | jesus |
| 08:30 | <Hixie> | and over a million had IDs that pointed to elements that weren't cells! |
| 08:31 | <Hixie> | ten thousand had overlapping cells |
| 08:32 | <Hixie> | in about four million cases, the headers="" attribute were redundant given the algorithm in the spec for mapping <th>s to <td>s |
| 08:32 | <Hixie> | in about 80,000 cases the headers="" attribute _would_ have been redundant if all the headers used <th> elements instead of <td> |
| 08:32 | <Hixie> | leaving about 2 million cases that might be valid which i'll have to look at |
| 08:35 | <Hixie> | 2 for 2 on broken uses so far |
| 09:19 | <hsivonen> | http://tools.ietf.org/html/draft-walsh-tobin-hrri-00 |
| 09:20 | <annevk> | that's been up for a while now, not? |
| 09:21 | <annevk> | although I don't think they are actually fixing anything |
| 09:21 | <annevk> | they are just widening the range of allowed characters |
| 09:25 | <hsivonen> | annevk: may have been. I dunno. found out today |
| 09:25 | <zcorpan> | a superset of IRI? |
| 09:26 | <hsivonen> | zcorpan: so it seems |
| 09:26 | <hsivonen> | URL5 |
| 09:26 | <zcorpan> | yeah |
| 09:27 | <annevk> | that's what we need, yes |
| 09:27 | <annevk> | that's not what it is :( |
| 09:28 | <hsivonen> | URL, URI, IRI, HRRI, URL5 |
| 09:30 | <zcorpan> | were there not more names somewhere in between? |
| 09:30 | annevk | learns about ephemeral |
| 09:30 | <annevk> | there's XRI -> HRRI |
| 09:30 | <annevk> | iirc |
| 09:31 | <annevk> | IRIs are not done yet fwiw |
| 09:38 | <annevk> | dropped / not included / omitted / ...? |
| 09:38 | <annevk> | suggestions? |
| 09:40 | <annevk> | excluded? |
| 09:41 | <zcorpan> | 2007-07-01 17:35 Ben 'Cerbera' Millard "absent" might be even better? |
| 09:41 | <zcorpan> | 2007-07-01 17:35 Ben 'Cerbera' Millard "not included" can still imply "we decided not to include these" |
| 09:41 | <zcorpan> | 2007-07-01 17:35 Ben 'Cerbera' Millard "absent" just means "not present" |
| 09:42 | <annevk> | cool |
| 10:04 | <zcorpan> | people really think that new features will suffer less from interop problems than existing features |
| 10:05 | <annevk> | it's mostly an academic exercise it seems |
| 10:05 | <annevk> | although not a real interesting one at that |
| 10:42 | <Hixie> | "Is XHTML 5 the successor of XHTML 2? Of course not." seems to beg the question with tr/52/21/ |
| 10:42 | <Hixie> | didn't someone already ask him that? |
| 10:44 | <Hixie> | oh i see henri basically said that already |
| 10:44 | <annevk> | maybe we should have "HTML 5" (language) and HTML and XHTML (syntax) |
| 10:44 | <annevk> | the XHTML syntax for HTML 5 shorthand would be XHTML5 but that would be unofficial |
| 10:44 | <othermaciej> | s/beg the question/invite the question/ |
| 10:45 | othermaciej | hopes that here at least he can still be gently pedantic |
| 10:45 | zcorpan | hasn't seen the tr/// constructor before |
| 10:45 | <othermaciej> | it's sed syntax |
| 10:45 | <othermaciej> | (also perl I think) |
| 10:46 | <othermaciej> | same source as s/foo/bar/ |
| 10:50 | <zcorpan> | seems useful :) |
| 10:52 | zcorpan | also learns that other puncation and parantheses can be used instead of slashes |
| 10:56 | <annevk> | the WHATWG sniffing algorithm doesn't seem to deal with .ico formats, bitmaps, etc. |
| 10:59 | <zcorpan> | http://del.icio.us/url/99931bd7993088a7dc60da0a031732e1 -- "(X)HTML4" |
| 10:59 | <Hixie> | annevk: seems easiest to just ignore the whole issue, frankly. it's not like the spec is called "xhtml5" |
| 10:59 | <Hixie> | annevk: does the spec allow for extra rows to sniff such types? |
| 11:00 | <krijnh> | zcorpan: vpieters? :| |
| 11:00 | <annevk> | Hixie, no it says "User agents must ignore any rows for image types that they do not support." |
| 11:00 | <annevk> | which seems to conflict with the warning earlier on |
| 11:00 | <annevk> | I might have mentioned that on the mailing list already |
| 11:00 | <zcorpan> | krijnh: and condor87 |
| 11:01 | <Hixie> | annevk: ah well we'll have to add rows then |
| 11:09 | annevk | ponders about <picture> |
| 11:10 | <annevk> | it seems such an obvious failure, how can they not see it? |
| 11:13 | <hsivonen> | annevk: indeed |
| 11:14 | <hsivonen> | annevk: Sander Tekelenburg's attempt at making it backwards compatible should show that the nice idea gets out of control quickly when you scratch the surface |
| 11:14 | <annevk> | neither proposal even works in IE7 |
| 11:15 | <hsivonen> | I try to focus on tree building instead spending the whole day replying to the list |
| 11:16 | <annevk> | I think I'll work on some tests for getBoundingClientRect and getClientRects or something |
| 11:16 | <annevk> | lunch first! |
| 11:16 | <hsivonen> | I'm getting more and more convinced that grouping by insertion mode first and by element second makes sense |
| 11:16 | <annevk> | you're keeping insertion modes? |
| 11:17 | <hsivonen> | with fall through for IN_TABLE etc. to IN_BODY and from IN_BODY to IN_HEAD_NOSCRIPT to IN_HEAD |
| 11:17 | <hsivonen> | annevk: no. I have just phases |
| 11:17 | <annevk> | oh ok |
| 11:17 | <annevk> | i like your code for the tokenizer quite a bit |
| 11:18 | <annevk> | although the comments are quite verbose |
| 11:18 | <hsivonen> | annevk: it's the spec :-) |
| 11:18 | <annevk> | yeah :) |
| 11:18 | <hsivonen> | too bad that doing the same for tree building is too much work |
| 11:19 | <annevk> | we just need lots of testcases |
| 11:19 | <annevk> | if zcorpan gets a proper browser framework to work for html5lib tests I assume we'll get even more testcases there |
| 11:20 | <hsivonen> | I intend to print my tree builder and the spec and go over them with a highlighter pen to check that everything is there |
| 11:20 | <annevk> | especially since the testformat is quite easy and the output can be generated using tools (assuming html5lib is compliant) |
| 11:21 | <annevk> | not sure yet how to test the formpointer stuff |
| 11:21 | <annevk> | that may require some extension |
| 11:22 | <hsivonen> | annevk: I have been thinking of a sanitizer tree that puts an UUID ID on <form> and form='' on out-of-subtree associated inputs |
| 11:29 | <Hixie> | so has anyone actually defined the problem that <picture> is intended to solve? |
| 11:31 | <hsivonen> | Hixie: implicitly, the problem is that <img> doesn't allow structured fallback--only a plain string |
| 11:31 | <Hixie> | aah |
| 11:32 | <Hixie> | does he elaborate on why <object> and longdesc="" don't handle this well enough? |
| 11:32 | <Hixie> | http://www.grupodignidade.org.br/projetos.php - <img src="img/logo.gif" alt="logo" width="160" height="80" longdesc="http://www.grupodignidade.org.br/img/logo.gif" /> |
| 11:32 | <Hixie> | sigh |
| 11:32 | <hsivonen> | Hixie: for <object>, yes. for longdecs, I no longer remember |
| 11:32 | <Hixie> | k |
| 11:33 | <Hixie> | bed time |
| 11:33 | <Hixie> | nn |
| 11:33 | <hsivonen> | nn |
| 11:39 | <annevk> | the table and longdesc study is interesting |
| 11:59 | <zcorpan> | hmm, it's not possible to check what case elements are in the dom in html, is it? except perhaps trying getElementsByTagNameNS or something |
| 12:04 | <annevk> | don't think so |
| 12:04 | <annevk> | unless localName is somehow secured |
| 12:05 | <zcorpan> | given webkit's implementation experience with my suggestion about localName, even that seems to be a dead end |
| 12:07 | <zcorpan> | i'll just have to use toLowerCase() |
| 12:11 | <zcorpan> | http://simon.html5.org/temp/html5lib-tests/wrapper.html -- got something working at least. now i just need to figure out how to parse and test the real files. or perhaps i'll just use another wrapper with some php. that may be simpler, dunno |
| 12:14 | <zcorpan> | the function fails in ie if there's a short bogus comment like <!foo> |
| 12:31 | <zcorpan> | </> results in a "/" element in ie |
| 12:38 | <zcorpan> | same as </foo> really |
| 12:39 | <zcorpan> | stray </x:y> gets dropped |
| 12:52 | <annevk> | dropping </> works just as well |
| 12:57 | <zcorpan> | oh sure. i was surprised that ie didn't drop it |
| 13:46 | <annevk> | lol |
| 13:46 | <annevk> | tr > tbody > td |
| 13:46 | <annevk> | tbody is not implied! |
| 13:59 | <Philip`> | Shouldn't that be "tbody > tr > td"? |
| 13:59 | <annevk> | yeah |
| 14:01 | <Philip`> | Ah |
| 14:43 | <zcorpan> | making progress...: http://simon.html5.org/temp/html5lib-tests/wrapper.html |
| 14:44 | <zcorpan> | now i just need to make the text file into two arrays |
| 14:45 | annevk | wonders in what kind of fantasyland some people live |
| 14:45 | <annevk> | "I was thinking exactly the opposite, and wondering whether Microsoft might be persuaded to migrate their horrific ?Active-X? strings from the opening <object> tag to an nested <param>." |
| 14:46 | <Philip`> | zcorpan: "Security error: attempted to read protected variable" - why doesn't Opera like that? |
| 14:47 | <zcorpan> | Philip`: dunno, works in Kestrel |
| 14:48 | <Philip`> | Oh, okay, maybe it's only a problem with 9.2 |
| 14:49 | <annevk> | evil data: URIs |
| 14:49 | <hsivonen> | annevk: in a world where the value of π is a legislative decision |
| 14:55 | <zcorpan> | any suggestions on how to read the text file with js? |
| 14:56 | <hsivonen> | zcorpan: XHR? |
| 14:56 | <zcorpan> | hsivonen: yeah. although in firefox i got a "syntax error" when trying to read .responseText |
| 14:59 | <zcorpan> | but let's assume that doesn't happen in firefox and i can read the file... how do i then parse it into two arrays? |
| 14:59 | <zcorpan> | my previous attempt with split() was too naïve and didn't really work |
| 15:00 | <Philip`> | Regular expressions? |
| 15:00 | <Philip`> | Whatever the problem, they are always the solution |
| 15:00 | <annevk> | :p |
| 15:00 | <hsivonen> | "now you have two problems" :-) |
| 15:00 | <annevk> | why doesn't split("\n\n") work? |
| 15:02 | <zcorpan> | does that work with multiple lines? |
| 15:02 | <zcorpan> | also, what if a test has e.g. \n\n as data |
| 15:02 | <zcorpan> | or doesn't the syntax allow for that? |
| 15:02 | <annevk> | oh right, yes |
| 15:02 | <zcorpan> | i think it does, so long as no test has \n\n as data |
| 15:03 | <annevk> | no \n\n can occur |
| 15:03 | <zcorpan> | ok |
| 15:03 | <annevk> | just split on \n\n#data or something and remove #data from the first line too |
| 15:03 | <zcorpan> | splitting removes automatically |
| 15:06 | <Philip`> | http://wiki.whatwg.org/wiki/Parser_tests#Tree_Construction_Tests doesn't seem to say it has to have blank lines between tests - the only delimiter is "\n#data\n" |
| 15:06 | <annevk> | sure, but the first test doesn't start with \n\n |
| 15:06 | <annevk> | Philip`, except for the first test... |
| 15:06 | <annevk> | also, two newlines is sort of accepted |
| 15:06 | <Philip`> | /^#data$/ |
| 15:06 | <Philip`> | /^#data$/ |
| 15:08 | <Philip`> | Uh |
| 15:08 | <Philip`> | /^#data$/m |
| 15:10 | <Philip`> | (or something like /\n*^#data\n/m if you want to strip newlines, assuming the last test doesn't end with a newline) |
| 15:11 | Philip` | wonders if anyone has written test cases for test case parsers |
| 15:12 | <Philip`> | though I'm not entirely sure how you'd parse the tests for the test parser |
| 15:12 | <zcorpan> | we need a parsing spec for the test case format |
| 15:12 | <zcorpan> | -_- |
| 15:40 | <annevk> | I tweaked http://wiki.whatwg.org/wiki/Parser_tests#Tree_Construction_Tests a bit to make it more clear what the actual format is |
| 15:41 | <Philip`> | The link at the bottom to the tests should probably be updated |
| 15:42 | <Philip`> | 'a line that says "#errors:"' - probably shouldn't have the colon |
| 15:43 | <annevk> | at some point the format used by http://html5lib.googlecode.com/svn/trunk/testdata/tree-construction/tests4.dat should be added too and the description could use some more whitespace... |
| 15:56 | <zcorpan> | yay |
| 15:57 | <zcorpan> | works in Kestrel now |
| 15:58 | <annevk> | zcorpan, sweet |
| 15:58 | <zcorpan> | firefox boils at...: Error: unexpected end of XML source |
| 15:58 | <zcorpan> | Source File: data:text/html,<script><div></script></div><title><p></title><p><p> |
| 15:58 | <zcorpan> | Line: 1, Column: 4 |
| 15:58 | <zcorpan> | Source Code: |
| 15:58 | <zcorpan> | <div> |
| 15:58 | <annevk> | ah |
| 15:59 | <zcorpan> | is that e4x or something? |
| 15:59 | <Philip`> | It works in precisely none of the five browsers I have access to :-( |
| 15:59 | <annevk> | put encodeURIComponent around it |
| 15:59 | <annevk> | maybe that will make it work better (it's also theoretically more correct) |
| 15:59 | <zcorpan> | don't think that's the problem |
| 15:59 | <zcorpan> | it's <script><div></script> in the actual test |
| 16:00 | <annevk> | maybe catch all error events and silence them? |
| 16:01 | <annevk> | iframe.onerror = function ... |
| 16:01 | <Philip`> | That would be parsed as E4X, I believe - it's only in the cases of <!--...--> and <![CDATA[...]]> where you have to use type="text/javascript;e4x=1" |
| 16:02 | <annevk> | iframe.onerror = null |
| 16:02 | <annevk> | or something |
| 16:02 | <Philip`> | (http://developer.mozilla.org/en/docs/E4X) |
| 16:02 | <zcorpan> | annevk: doesn't help |
| 16:02 | <zcorpan> | annevk: don't think JS errors bubble up to the parent document |
| 16:03 | <annevk> | zcorpan, iframe.contentWindow.onerror = null |
| 16:03 | <zcorpan> | annevk: nope |
| 16:04 | <annevk> | does it actually work if you remove that test? |
| 16:05 | <zcorpan> | hmm. no. |
| 16:05 | <annevk> | btw, it would be nice if you showed the input data in the result tree as well |
| 16:06 | <annevk> | makes it easier to analyze potential errors |
| 16:06 | <Philip`> | Could change the tests to do <script type="unsupported"> so browsers won't try running them |
| 16:07 | <annevk> | that may work |
| 16:08 | <zcorpan> | or use //<div> instead of <div> |
| 16:08 | <zcorpan> | annevk: done |
| 16:08 | <annevk> | done what? |
| 16:09 | <zcorpan> | showed the input data |
| 16:09 | <annevk> | ah |
| 16:09 | <annevk> | does it matter though that browsers run them? |
| 16:10 | <zcorpan> | no, don't think so |
| 16:10 | <annevk> | zcorpan, btw iframe.contentWindow.onerror = function(foo,bar,baz) { return false } |
| 16:10 | <annevk> | might prevent the error from appearing |
| 16:10 | <zcorpan> | it's some other reason why it doesn't work in firefox |
| 16:10 | <zcorpan> | ok |
| 16:12 | <zcorpan> | xhr only works on the same domain, right |
| 16:12 | <zcorpan> | might need a server side script to include external tests |
| 16:12 | <annevk> | yeah, same-origin |
| 16:15 | <Philip`> | If the external tests were in a format that was valid JS, you could include them with <script src> |
| 16:16 | <zcorpan> | well, they're not. :) |
| 16:16 | <Philip`> | Or if you could change the external tests to be in a format that was valid JS :-) |
| 16:17 | <zcorpan> | seems simpler to write a server-side wrapper for this |
| 16:17 | <Philip`> | but I guess the point of it being external is that it's external and out of your control |
| 16:17 | <annevk> | zcorpan, how about a document.write() version? |
| 16:18 | <zcorpan> | annevk: ? |
| 16:18 | <annevk> | zcorpan, instead of iframe.src = do iframe.contentDocument.open(); iframe.contentDocument.write(testdata); etc. |
| 16:18 | <annevk> | that's how the live-dom-viewer works |
| 16:19 | <zcorpan> | ah |
| 16:19 | <zcorpan> | ok |
| 16:21 | <zcorpan> | it doesn't fire a load even then. but i guess i could make it work. what's the benefit? |
| 16:21 | <annevk> | works in IE |
| 16:21 | <annevk> | just copy some of the live-dom-ivewer logic |
| 16:21 | <annevk> | should be doable |
| 16:24 | <zcorpan> | works in firefox with that change |
| 16:25 | <zcorpan> | and opera 9.2 |
| 16:27 | <zcorpan> | ie only wants to load the first test |
| 16:30 | <annevk> | that's an improvement |
| 16:32 | <zcorpan> | "childNodes is null or not an object" |
| 16:32 | <zcorpan> | for (var i = 0; i < node.childNodes.length; i += 1) { |
| 16:34 | <annevk> | hmm |
| 16:34 | <zcorpan> | ah |
| 16:34 | <zcorpan> | contentDocument -> contentWindow.document |
| 16:34 | <annevk> | whoa |
| 16:35 | <annevk> | that's supposed to be equivalent |
| 16:35 | <Philip`> | It's kind of irritating when you're trying to write tests to help interoperability between browsers, but then you can't even write a script to run the tests without hitting non-interoperability issues between every browser... |
| 16:35 | <zcorpan> | now it works in ie |
| 16:35 | <zcorpan> | Philip`: yeah |
| 16:35 | <zcorpan> | but it outputs everything on one line |
| 16:36 | <zcorpan> | \n -> \r\n ? |
| 16:36 | <annevk> | yeah |
| 16:37 | <zcorpan> | YAY! |
| 16:37 | <zcorpan> | :D |
| 16:37 | <zcorpan> | doesn't work in safari though |
| 16:38 | <annevk> | hmm |
| 16:38 | <annevk> | blame mjs :p |
| 16:38 | <zcorpan> | othermaciej: yt? :) |
| 16:39 | <annevk> | IE fails everything because of its fixed <title> |
| 16:41 | <annevk> | zcorpan, the test output numbers don't match the test input numbers |
| 16:41 | <annevk> | zcorpan, it seems that way |
| 16:41 | <zcorpan> | the output numbers is 1 greater right? |
| 16:42 | <annevk> | hmm, IE and Opera seem to be one off |
| 16:42 | <zcorpan> | yeah |
| 16:42 | <zcorpan> | it's correct |
| 16:42 | <zcorpan> | the first test is empty |
| 16:42 | <zcorpan> | .split(/\n*#data\n/m) |
| 16:42 | <annevk> | so why are they one off? |
| 16:43 | <annevk> | IE saying it's 24 and Opera claiming it's 25... |
| 16:43 | <zcorpan> | "foobar".split("foo") // ["", "bar"] |
| 16:44 | <zcorpan> | i guess i could remove the first entry from the array but it seemed simpler to ignore it |
| 16:45 | <zcorpan> | they might do different things with split() |
| 16:47 | <zcorpan> | yep |
| 16:47 | <zcorpan> | javascript:(function(){var arr = "#data\nfoo".split(/\n*#data\n/m); alert(arr.length); })() |
| 16:49 | <Philip`> | (Is it intentional that that will match strings like "foo#data\n"?) |
| 16:49 | <zcorpan> | not really |
| 16:50 | <Philip`> | (That was what the ^ in /\n*^#data\n/m was for :-) ) |
| 16:51 | <zcorpan> | (fixed) |
| 16:53 | <zcorpan> | ok, fixed the number of tests issue |
| 16:56 | <zcorpan> | ie passes test 101 |
| 16:57 | <annevk> | <html><head><title></title><body></body></html> ... |
| 16:58 | <zcorpan> | amazing that i got the format right on the first try. i didn't even look at the documentation |
| 16:58 | <annevk> | hixie designed it |
| 16:59 | <zcorpan> | Hixie: if you could get people use html right on the first try... ;) |
| 16:59 | <annevk> | I'm quite disappointed by the large number of fails |
| 16:59 | <annevk> | Hopefully that will improve in due course by either updating the tests or the spec |
| 17:00 | <zcorpan> | annevk: in which browser? |
| 17:00 | <annevk> | all? |
| 17:00 | <Philip`> | Could you make a table of the results for all browsers, to see which tests don't match any browser's reality? |
| 17:01 | <zcorpan> | i guess |
| 17:01 | <zcorpan> | but there are more tests |
| 17:01 | <zcorpan> | i want to figure out how to run those |
| 17:01 | <zcorpan> | first food |
| 17:01 | <annevk> | another for loop around the xhr |
| 17:01 | <annevk> | or just merge everything on the server |
| 17:01 | <zcorpan> | yeah |
| 17:02 | <annevk> | it would be good if you at some point comitted this back to html5lib |
| 17:03 | <annevk> | then we can make the acid-parser test |
| 17:03 | <zcorpan> | perhaps i don't need to do server side magic |
| 17:03 | <annevk> | other things that might be nice: 1) some colors on the result page to make it easier to scan 2) collapsable items on the result page |
| 17:04 | <annevk> | especially the second is useful given the large number of tests that fail :) |
| 17:04 | zcorpan | makes notes |
| 17:05 | <annevk> | zcorpan, did you "fix" the difference in counting with IE? |
| 17:07 | <annevk> | I'm thinking that it might be useful to include a bunch of <title></title> in a lot of testcases to make the IE results more usable |
| 17:08 | <Philip`> | Could you post-process the results to ignore ones where the only difference is the "| <title>" line? |
| 17:09 | <Philip`> | (or mark as uninteresting, rather than entirely ignore them) |
| 17:10 | <annevk> | that'd be another option |
| 17:10 | <annevk> | prolly better |
| 17:32 | <rubys> | any html5lib developers awake here? :-) |
| 17:36 | annevk | is |
| 17:37 | <annevk> | zcorpan ported html5lib tests to browsers |
| 17:37 | <annevk> | see http://simon.html5.org/temp/html5lib-tests/wrapper.html for tree-construction/tests1 |
| 17:38 | <rubys> | Anne, can you do me a favor and svn update and then run: |
| 17:38 | <rubys> | python parse.py --tree "<p><b><i><u></p><p>X" |
| 17:41 | <annevk> | get two <p> siblings the second containing the same as the first plus "X" as deepest child |
| 17:43 | <rubys> | nevermind, I found my problem (the actual test2 #45 actually has a new line in the middle) |
| 17:43 | <rubys> | sorry to bother you |
| 17:43 | <annevk> | no worries |
| 18:01 | <annevk> | hsivonen, how would this UUID stuff work? |
| 18:02 | <annevk> | hsivonen, what I'm interested in is annotating the test results for tree construction with that information |
| 18:28 | <met_> | http://ydnar.vox.com/library/post/webkit-team-adds-audio-video-support.html |
| 18:35 | <zcorpan> | annevk: i did |
| 18:40 | <othermaciej> | zcorpan: what's the problem? |
| 19:51 | <zcorpan> | othermaciej: http://simon.html5.org/temp/html5lib-tests/wrapper.html doesn't work in safari (for windows). don't know why |
| 19:52 | <othermaciej> | I was hoping it would be obvious but there's a whole lot of script there |
| 19:53 | <zcorpan> | would the web inspector help me debug? how do i activate it on windows? |
| 19:53 | <othermaciej> | zcorpan: it's got a "parse error" and a "maximum call stack size exceeded" |
| 19:53 | <othermaciej> | the JavaScript error console (in the debug menu) would tell you that |
| 19:53 | <zcorpan> | don't see a debug menu |
| 19:54 | <othermaciej> | yeah, you have to turn it on with a command-line switch |
| 19:54 | <othermaciej> | google for "safari windows debug menu" |
| 19:54 | <othermaciej> | I don't remember the details at the moment |
| 19:54 | <billmason> | http://rakaz.nl/item/enabling_the_debug_menu_on_safari_for_windows |
| 19:54 | <zcorpan> | ok, will do |
| 19:54 | <othermaciej> | is dom2string going to recurse to a depth of more than 99? |
| 19:54 | <zcorpan> | billmason: cheers |
| 19:54 | <othermaciej> | if so, that's probably the problem |
| 19:55 | <othermaciej> | we should probably relax that stack limit |
| 19:55 | <zcorpan> | it might |
| 19:57 | <zcorpan> | but i don't think that's the problem, it didn't work with one test with the input "Test" either |
| 20:03 | <zcorpan> | is "run" a preserved word? |
| 20:05 | <hasather> | zcorpan: no |
| 20:05 | <zcorpan> | what is the SyntaxError: Parse Error on line 1 in http://simon.html5.org/temp/html5lib-tests/wrapper.html ? |
| 20:16 | <zcorpan_> | works when i have only 1 test in the file |
| 20:16 | <zcorpan_> | 2 tests as well |
| 20:17 | <hasather> | seems to be a problem with the test that looks like this: "<script><div></script></div><title><p></title><p><p>" |
| 20:20 | <hasather> | zcorpan: that seems to be the only test that has unallowed content in a script element |
| 20:22 | <jgraham> | zcorpan_: TestData in http://html5lib.googlecode.com/svn/trunk/python/tests/support.py contains the testcase parser that html5lib uses (you have to pass it a list of the section headings e.g. ("data", "errors", "document")) |
| 20:22 | <jgraham> | (that was a FYI if you have any more issues with the test format) |
| 20:28 | <zcorpan_> | hasather: ah. yes of course |
| 20:29 | <zcorpan_> | jgraham: thanks |
| 20:31 | <zcorpan_> | othermaciej: seems like the problem is the number of recursions indeed. not sure if i can/will work around that |
| 20:34 | <othermaciej> | zcorpan_: I'm sure your function could easily be rewritten not to be recursive |
| 20:34 | <zcorpan_> | othermaciej: can you do it for me? :) |
| 20:36 | <othermaciej> | zcorpan_: don't have time to actually test, but I can tell you roughly how to do it |
| 20:37 | <othermaciej> | you're effectively doing a preorder tree traversal |
| 20:37 | <othermaciej> | you can do that with a stack, or since you have parent pointers just with a simple loop |
| 20:38 | <othermaciej> | when entering a node, you do the entry processing (print node itself, increment indent) |
| 20:39 | <othermaciej> | then you check if it has children - if so, enter the first child |
| 20:39 | <zcorpan_> | (the live dom viewer has the same problem btw) |
| 20:39 | <othermaciej> | if no children, check for a next sibling - if present, do exit processing for current node and enter the next sibling |
| 20:40 | <othermaciej> | if no next sibling, do exit processing for this node, then continue from the parent as if it had no children (i.e. exit to the parent's next sibling or parent's parent and so forth) |
| 20:40 | <zcorpan_> | ok. thanks |
| 20:41 | <othermaciej> | we use this style of tree traversal internal to webcore all the time |
| 20:41 | <othermaciej> | in fact, we have an internal traverseNextNode function that does it |
| 20:41 | <othermaciej> | (although that doesn't visit a node again when exiting, which I think you want) |
| 20:42 | <zcorpan_> | yeah, i want to catch misnested nodes in ie |
| 20:43 | <zcorpan_> | or perhaps that's just a check before you process the children |
| 22:06 | <zcorpan_> | hmm. the question is how to handle misnested nodes. |
| 22:17 | <Philip`> | zcorpan_: Output "FAIL" and then stop? |
| 22:36 | othermaciej | facepalms at continuing mail from Rob Burns |
| 22:38 | <zcorpan_> | Philip`: yeah... but the recursive algorithm could output the entire tree anyway, which is nicer for debugging |
| 22:38 | <Philip`> | I don't quite see how trying to publish one document after four months counts as "rushing" |
| 22:39 | <Hixie> | <td id="m1" axis="mainMenu" headers="m1" valign="top"> |
| 22:39 | <Hixie> | sigh |
| 22:39 | <zcorpan_> | Hixie: hah |
| 22:40 | <othermaciej> | now that's some compact information |
| 22:40 | <othermaciej> | Hixie: is that the sort of thing causing all the cycles? |
| 22:44 | <Hixie> | it's at least one cause |
| 22:44 | <Hixie> | i'm going to rerun the survey with a special hack to count those sperately |
| 22:47 | <Hixie> | i really have to stop e-mailing public-html |
| 23:04 | <zcorpan_> | annevk: are there tests on things like </p>, <html></p>, <head></p>, etc, in the html5lib tests? |
| 23:05 | <zcorpan_> | public-html starts to get pretty high traffic again |
| 23:16 | <Hixie> | typical longdesc: http://130.83.47.128/masterfiles/descriptions/logo.txt |
| 23:16 | <webben> | typical of what? |
| 23:17 | <Hixie> | typical of the longdescs that are actually not completely bogus |
| 23:17 | <Hixie> | (that's from http://130.83.47.128/vv/ss/comments/13.205.en.tud) |
| 23:17 | <Hixie> | (the first one on my list of "interesting" uses) |
| 23:18 | <webben> | not a terrible longdesc I suppose |
| 23:18 | <webben> | distinguishing between alternate text and explaining what the image is |
| 23:18 | <Hixie> | <a href="http://www.google.co.jp/"> |
| 23:18 | <Hixie> | <img src="http://blog2.fc2.com/2/20century/file/Logo_20s.gif" alt="Google" height="75" width="143" longdesc="http://www.google.co.jp/logos.html" /></a> |
| 23:18 | <webben> | shame they didn't explain what the logo actually depicts |
| 23:19 | Hixie | bangs head against table |
| 23:19 | <jgraham> | zcorpan_: I can't see any tests for those cases (htough I thought anne had checked some in...). If you want to add some I can add you to the html5lib members list |
| 23:20 | <webben> | Hixie: maybe the text is helpful for that one |
| 23:20 | webben | can't read Japanese |
| 23:20 | <webben> | oh wait, Google can read Japanese |
| 23:20 | <Philip`> | But that logo.txt longdesc is in the wrong language for that page (which I guess could be because the site's developers had no way to actually test longdesc so it fell out of sync with the page contents)... |
| 23:20 | <Hixie> | from that en.tud page, lower down: |
| 23:20 | <Hixie> | <img src="/masterfiles/images/blue10x1.gif" alt="[Abstandhalter]" title="[Abstandhalter]" longdesc="/masterfiles/descriptions/abstandhalter.txt"> |
| 23:20 | <Hixie> | guess what the "/masterfiles/descriptions/abstandhalter.txt" file contains |
| 23:20 | <webben> | Philip`: good point |
| 23:23 | <Hixie> | i think i've yet to see an actual useful, value use of longdesc="" in this study |
| 23:24 | <Hixie> | bbl |
| 23:24 | <webben> | Hixie: you should include uses of D-links |
| 23:24 | <webben> | since for a long time D-link was used as a longdesc alternative based on poor support for longdesc |
| 23:26 | <webben> | see also: http://www.w3.org/TR/WCAG10-HTML-TECHS/#long-descriptions |
| 23:26 | <webben> | it would be interesting to know how many links in the wild have a value of D or [D] or similar |
| 23:26 | <webben> | s/value/text content/ |
| 23:28 | Philip` | wants to rewrite his own rubbish survey tool to be slightly less rubbish, so he can get vaguely interesting numbers about common features |
| 23:29 | <webben> | how many links ... and what they point to, of course |
| 23:29 | jgraham | wants a google-scale cluster to run a survey on |
| 23:30 | <jgraham> | and a pony, of course |
| 23:31 | <jgraham> | But seriously, Philip`, it would be nice if your survey tool was more widely available. It would be even better if the parser was fast. I wonder if any of the HTML5-parser-in-C projects are going to produce something soon? |
| 23:32 | <Philip`> | At least my initial version taught me that SQLite is completely rubbish when you have concurrency - it kept throwing exceptions because the whole database was locked |
| 23:32 | <Philip`> | so I need to rewrite it with MySQL or something |
| 23:34 | <Philip`> | and I think it should do some simple crawling, rather than only looking at a fixed list of URLs, so it can find more stuff to look at |
| 23:35 | <Philip`> | (and a faster parser would definitely be useful :-) ) |
| 23:37 | <Philip`> | (A Java one would probably be as good as a C one) |
| 23:39 | <bewest> | sounds like a bunch of people are interested in some kind of survey tool available to the community |
| 23:40 | <webben> | Here's a good example of longdesc-as-long-alternative: http://www.fhwa.dot.gov/hfl/framework/04.cfm referring to http://www.fhwa.dot.gov/hfl/framework/longdesc.cfm#fig1 |
| 23:40 | <bewest> | purpose would be 2-fold, correct? 1.) survey useage of authoring techniques on the web. 2.) test parsers? |
| 23:41 | <Philip`> | 3.) Confirm whether Hixie's stats are reasonable, or if he's just making up all the numbers :-) |
| 23:42 | <bewest> | I've thought about doing this with ec2 and Alexa's web services |
| 23:42 | <bewest> | eg greptheweb, and MSR |
| 23:42 | <bewest> | alexa has crawled documents in s3 |
| 23:43 | <bewest> | but that costs money |
| 23:44 | <zcorpan_> | jgraham: sure. i might check in this browser port too |
| 23:45 | <zcorpan_> | othermaciej: rewrote the function to not be recursive but still get the same error in safari |
| 23:45 | <bewest> | Philip`: so you already have some kind of survey tool? how does it work? |
| 23:46 | <Philip`> | bewest: Ah, I wasn't aware of those things, though I tend to never consider anything that requires money :-) |
| 23:47 | <bewest> | yeah... |
| 23:47 | <bewest> | usually I don't either |
| 23:47 | <bewest> | except that I work at the company that makes those services |
| 23:47 | <Philip`> | It was just something simple for things like http://canvex.lazyilluminati.com/misc/copyright.html and http://canvex.lazyilluminati.com/misc/summary.html |
| 23:48 | <Philip`> | (and a few other things which I can't remember where I put) |
| 23:48 | <Philip`> | where I give it a list of a few thousand URLs (from Yahoo search results for arbitrary terms), and it just downloads them then parses them (with html5lib) and looks for certain stuff |
| 23:49 | <Philip`> | (and sort of does those things in parallel, if you run lots of copies of the program, except most of the processes keep dying because SQLite gets unhappy) |
| 23:50 | <Philip`> | (and then some pages cause quadratic behaviour in html5lib and you have to manually delete them from the database) |
| 23:50 | <Philip`> | (so it's all just horribly hacked together :-p ) |
| 23:51 | <bewest> | heh |
| 23:52 | <othermaciej> | zcorpan_: that's odd |
| 23:52 | <othermaciej> | zcorpan_: pointer? |
| 23:53 | <zcorpan_> | othermaciej: http://simon.html5.org/temp/html5lib-tests/wrapper.html |
| 23:53 | <Hixie> | webben: studying text contents is much harder for various reasons |
| 23:54 | <webben> | of course it's harder |
| 23:54 | <webben> | but given we're talking about what's basically a language for marking up text, such study is pretty critical |
| 23:55 | <Hixie> | be my guest :-) |
| 23:57 | <othermaciej> | zcorpan_: very confusing |
| 23:57 | <othermaciej> | zcorpan_: I'll try debugging it in a while - need to get coffee first |
| 23:57 | <zcorpan_> | othermaciej: ok |
| 23:58 | <zcorpan_> | man, i've really spent all day on this thing |
| 23:59 | <Hixie> | how does it feel to be paid to do this nonsense? :-) |
| 23:59 | <jgraham> | zcorpan_: You should now be able to commit to html5lib svn If you're committing tests that html5lib doesn't pass, it's really good to email html5lib-discuss⊙gc so people know there hasn't been a regression |