| 00:57 | <takkaria> | annevk: on your latest post, first para, s/totaled/totalled/ and s/amount/number/ |
| 00:59 | <annevk> | ty, fixed |
| 01:00 | <annevk> | night all |
| 01:00 | annevk | -> bed |
| 01:58 | <annevk> | http://lists.w3.org/Archives/Public/public-xhtml2/2008Apr/0020.html |
| 01:58 | annevk | can't sleep (for those who read the earlier message) |
| 02:12 | <BenMillard> | it was snowing here yesterday morning...very unusual for this part of the UK |
| 02:13 | <BenMillard> | especially in April! |
| 02:16 | <Philip`> | Unfortunately it had all melted by the time I woke up (around 1pm) :-( |
| 02:16 | <BenMillard> | I took some photos...I suck at doing that, though they will appear on my blog eventually |
| 02:21 | <tomg> | it was nice |
| 02:21 | <tomg> | was shocked at how white it was |
| 02:23 | <BenMillard> | indeed, it was proper powder snow |
| 02:23 | <tomg> | only light snow was forecast |
| 02:26 | Hixie | e-mails the xhtml2 group |
| 02:41 | <MikeSmith> | BenMillard - can you give a brief summary of the goals of your ongoing table work? |
| 02:41 | <MikeSmith> | is there are particular hypothesis you're working under? |
| 02:42 | <BenMillard> | it's mostly to understand how tables get authored in reality, so we can design HTML5 features to be more realistic and robust |
| 02:42 | <BenMillard> | particularly the table header association mechanism |
| 02:43 | <BenMillard> | where "we" means HTMLWG in general |
| 02:44 | <BenMillard> | I should probably summarise the goals in the document :P |
| 02:45 | <MikeSmith> | BenMillard - yeah, that would be useful, I think |
| 02:46 | <MikeSmith> | summary of what you've done so far, and what else you want to do going forward |
| 02:46 | <MikeSmith> | what specific product/deliverable you have in mind |
| 02:46 | <BenMillard> | the document starts with a "Numbers" section which summarises what's been done |
| 02:47 | <MikeSmith> | for example, we could end up producing/publishing a W3C Note from it |
| 02:47 | <MikeSmith> | BenMillard - what's the URL? |
| 02:47 | <BenMillard> | oh, sorry I thought you were looking at it! it's here: http://sitesurgeon.co.uk/tables/ |
| 02:48 | <BenMillard> | deliverables include helping develop the HTML5 header association algorithm with other HTMLWG participants (James Graham and Simon Pieters, mostly) |
| 02:49 | <MikeSmith> | BenMillard - hmm, seems like that wouldn't be a specific deliverable but instead something the deliverable/document could be used for |
| 02:50 | <MikeSmith> | have you gotten much feedback from WAI folks about it yet? |
| 02:50 | <BenMillard> | I don't recall talking to any WAI people about it yet |
| 02:50 | <Hixie> | http://www.w3.org/2003/entities/2007xml/unicode.xml has a last-modified date of 1970 |
| 02:50 | <Hixie> | go w3c! |
| 02:50 | <MikeSmith> | heh |
| 02:51 | <BenMillard> | MikeSmith, it was done in my spare time so I've not been tracking feedback around it |
| 02:51 | Hixie | removes the -N from his command line so that hsi script will always download the file instead of assuming that it has an accurate date |
| 02:52 | <Hixie> | oh actually |
| 02:52 | <Hixie> | that could be a minefield bug |
| 02:52 | <Hixie> | with xslt |
| 02:52 | <Hixie> | nevermind |
| 02:52 | <MikeSmith> | Hixie - you mean the headers? |
| 02:52 | <MikeSmith> | Last-Modified: Sun, 06 Apr 2008 10:46:09 |
| 02:52 | <Hixie> | yeah |
| 02:52 | <Hixie> | yeah, minefield bug with xslt |
| 02:53 | <Hixie> | lovely |
| 02:53 | <MikeSmith> | ah, OK |
| 02:53 | <MikeSmith> | go minefield! |
| 02:54 | <BenMillard> | MikeSmith, yes the document itself does not contain a product |
| 02:54 | <MikeSmith> | BenMillard - fwiw, I think it might be worthwhile to post the URL and short summary to one of the WAI lists |
| 02:54 | <MikeSmith> | if you've not done that already |
| 02:55 | <BenMillard> | it's not under active development at the moment; I can't afford the time |
| 02:56 | <BenMillard> | I'd like it to be used by anyone who finds it useful, though |
| 02:56 | MikeSmith | is getting lots of "Connection reset by peer http://intertwingly.net/blog/index.atom" when trying to get Sam's feed |
| 02:56 | <MikeSmith> | BenMillard - I see |
| 02:57 | <BenMillard> | I've been seeking sponsorship from various places to make the work more sustainable |
| 02:57 | <BenMillard> | I took the whole of last month off work to try and speed that up, but haven't quite got there |
| 03:02 | <MikeSmith> | BenMillard - I would think investing some time in doing a little awareness-raising about might help with getting others to consider sponsoring further work |
| 03:02 | <MikeSmith> | e.g., by posting about it to WAI lists or elsewhere |
| 03:03 | <BenMillard> | MikeSmith, that's a good idea. can you suggest a specific list for me to post it to? I've only sent it to public-html before now. |
| 03:05 | <MikeSmith> | BenMillard - no, I can't suggest any specific list. I don't follow them. Karl Dubost might know |
| 03:06 | <MikeSmith> | might be worthwhile for you to get in touch with Michael Cooper and/or Shawn Henry |
| 03:06 | <MikeSmith> | both W3C staff for WAI |
| 03:07 | <BenMillard> | I've spoken to Shawn Henry in person in November 2007; I could ask her |
| 03:10 | <MikeSmith> | BenMillard - yeah, she would certainly be able to tell you |
| 03:11 | MikeSmith | drops off to head into office |
| 03:11 | <BenMillard> | MikeSmith, thanks for your advice about this. I'm a noob when it comes to W3C! |
| 03:11 | <BenMillard> | oh bugger, he left just before I sent that |
| 03:35 | <jwalden> | BenMillard: ^ |
| 03:38 | <MikeSmith> | BenMillard - I'm a noob too.. I've worked for the W3C only for a year now. most people here actually have a lot more experience with the W3C than I do |
| 03:40 | <MikeSmith> | I'm sort of like a foreign element that's been inserted, and not clear yet if it's going to make things worse or better |
| 03:42 | <jwalden> | <foreignElement id="MikeSmith"/> |
| 03:49 | <BenMillard> | MikeSmith, that's cool |
| 03:49 | MikeSmith | reads dbaron's latest |
| 03:50 | <MikeSmith> | http://dbaron.org/log/20080406-acid3 |
| 03:50 | <MikeSmith> | "Teaching to the Test" (on HTML5 and Acid3 and Firefox) |
| 04:17 | <BenMillard> | MikeSmith, I've updated the tables document: http://sitesurgeon.co.uk/tables/ |
| 04:18 | <BenMillard> | away for lunch now (at 4am local time!) |
| 05:22 | <BenMillard> | MikeSmith, I've updated the tables document http://sitesurgeon.co.uk/tables/ is this better? |
| 05:46 | <MikeSmith> | BenMillard - which parts did you change/add? |
| 05:46 | <MikeSmith> | The Goals and Deliverables part? |
| 05:46 | <BenMillard> | that's right |
| 05:47 | <BenMillard> | I also moved the Feedback section to the top to emphasise the openness of it's development |
| 05:49 | <BenMillard> | MikeSmith, using the UserName, Message format is more compatible with IRC clients and the #whatwg log than using UserName - Message |
| 05:49 | <BenMillard> | e.g.: |
| 05:49 | <BenMillard> | BenMillard, which parts did you change/add? |
| 05:49 | <MikeSmith> | um, thanks for the tip |
| 05:50 | <Hixie> | personally i'm more of a foo: bar fan than a foo, bar fan |
| 05:51 | <BenMillard> | either of those are compatible with IRC clients and the log, afaict, but foo - bar is not |
| 05:52 | <MikeSmith> | BenMillard - I don't know what you mean by compatible with IRC clients and the log |
| 05:52 | <MikeSmith> | not that I really care to get into a discussion about it |
| 05:53 | <BenMillard> | the highlight lines directed at you when either of the conventional methods are followed |
| 05:53 | <BenMillard> | *they |
| 05:54 | <BenMillard> | so if you address me as BenMillard, message it gets highlighted and I'm more likely to see it |
| 05:54 | <BenMillard> | (or BenMillard: message) |
| 05:55 | <Hixie> | either works with a decent irc client like irssi :-) |
| 05:55 | <MikeSmith> | BenMillard - yeah, I'd say that's problem between you and your IRC client |
| 05:55 | jwalden | snickers |
| 05:56 | <jwalden> | don't most clients get set off when your nick appears anywhere in the line, surrounded by \b ? |
| 05:58 | <BenMillard> | the , and : forms are the only ones I've found interoperable...is there a standard for this? |
| 05:59 | <jwalden> | standard? IRC? hah! |
| 06:00 | <jwalden> | or so I've been led to believe |
| 06:02 | <MikeSmith> | I think the standard for IRC is "please leave your %s at the door" |
| 06:03 | <MikeSmith> | jwalden - fwiw, XChat highlights the nick of the user who uses your nick in a message |
| 06:03 | <MikeSmith> | not your own nick |
| 06:03 | <MikeSmith> | which sorta makes sense to me |
| 06:04 | <MikeSmith> | I know my own name |
| 06:04 | <MikeSmith> | (most of the time) |
| 06:04 | <jwalden> | I used the phrase "set off" because it probably differs across clients; Chatzilla highlights the individual message, I assume others do other thigns |
| 06:04 | <MikeSmith> | the useful piece of information for this case seems to be, who's calling? |
| 06:04 | <jwalden> | s/gns/ngs/ |
| 06:06 | <BenMillard> | the IRC logs for this channel are athttp://krijnhoetmer.nl/irc-logs/ and it support both the , and : forms. so if you use one of those forms, while in this channel, the highlighting feature will work in the logs |
| 06:06 | <BenMillard> | although I agree that any message which mentions a user's name should light up in that user's client |
| 07:12 | <hsivonen> | http://bitworking.org/news/317/Revisionist-XHistory |
| 07:12 | <othermaciej> | hi all |
| 07:13 | <hsivonen> | hi |
| 07:27 | <hsivonen> | surprisingly, one of the hotspots of Validator.nu is TreeSet doing a silly number of compares when the inserted items are already ordered or reverse-ordered. |
| 07:28 | <hsivonen> | too bad the JDK and Commons Collections don't seem to have head/tail-biased linked list-backed SortedSets. |
| 07:29 | <othermaciej> | that's not a good way to make a sorted data structure |
| 07:30 | <othermaciej> | (it's a good way to make an ordered associative data structure) |
| 07:30 | <hsivonen> | othermaciej: what's not good? TreeSet? |
| 07:30 | <othermaciej> | no, a linked-list backed set |
| 07:30 | <othermaciej> | a TreeSet (assuming it's a balanced tree) is a good way to make a balanced associative structure |
| 07:31 | <othermaciej> | but inserting N items into it should be N log N so it's surprising it would be a hot spot |
| 07:31 | <hsivonen> | othermaciej: why not if you know the insertion will always be either to head of the list or the next from head? |
| 07:31 | <othermaciej> | if you have items that are already ordered, you would want to use a ListHashSet |
| 07:31 | <othermaciej> | (I think that's what Java calls it) |
| 07:32 | <hsivonen> | othermaciej: the insertions are *almost* ordered |
| 07:32 | <hsivonen> | that is, the new insertion is most often to the head of the list, but sometimes a slot or two further |
| 07:33 | <hsivonen> | (LinkedHashSet is not what I need here) |
| 07:33 | <othermaciej> | that doesn't have an insertBefore? |
| 07:33 | <othermaciej> | (too bad, it should be easy to do) |
| 07:35 | <hsivonen> | it appears it does not |
| 07:39 | <hsivonen> | the show source feature ends up comparing locations 29 times the number of location objects |
| 07:39 | <hsivonen> | that's not good |
| 08:08 | <BenMillard> | annevk, you wrote "heh, RDF fanatics use " and <em> for quotes" and I see things like that quite frequently on the blogs of markup/accessibility enthusiasts/experts |
| 08:12 | <BenMillard> | indeed, it's hard to find anyone using <q>...present company excepted :) |
| 08:13 | <BenMillard> | getting the right punctuation seems more important to authors than using the right element |
| 08:17 | <jwalden> | problem being q's quotation behavior (CSS's, that is) is underspecified, as dbaron tells me |
| 08:18 | <BenMillard> | Hixie, I made editorial changes to http://sitesurgeon.co.uk/tables/ which include clarifying the markup used by each group in "How Authors Indicate Headers in Data Tables". They were a bit vague before. Let me know if this changes anything. |
| 08:18 | <Hixie> | k |
| 08:19 | <Hixie> | i probably won't look at table stuff for some time |
| 09:09 | zcorpan_ | did not know about Document.strictErrorChecking |
| 09:10 | zcorpan_ | will define in dom5core what happens when it is false |
| 09:24 | <BenMillard> | Philip`, my (badly taken) photos of UK snow are now blogged: http://projectcerbera.com/blog/2008/04#day06 |
| 09:46 | <jwalden> | oh man |
| 09:46 | <jwalden> | that looks AWESOME |
| 10:55 | <hsivonen> | othermaciej: swiching to HeadBiasedSortedSet and TailBiasetSortedSet changed the comparison patterns to the better by approximately a factor of 29 |
| 10:56 | <hsivonen> | I don't know what kind of balanced tree TreeSet has, but it sure compares a *lot* |
| 10:57 | <othermaciej> | how big is your data set? |
| 10:57 | <hsivonen> | othermaciej: the HTML 5 spec |
| 10:57 | <hsivonen> | about 16000 items in the set |
| 10:58 | <othermaciej> | log base 2 of that is 14 |
| 10:58 | <othermaciej> | must not be that well balanced |
| 10:58 | <othermaciej> | (well, close to 14) |
| 10:58 | <hsivonen> | or it compares everything twice |
| 10:58 | <hsivonen> | or something |
| 10:59 | <hsivonen> | the next big hotspots are IO and XPath |
| 11:01 | <hsivonen> | making IO go away is hard, but making XPath go away is quite doable and something I want to do anyway |
| 11:02 | <othermaciej> | I am still somewhat surprised that inserting into a tree-based data structure could be the top hot spot for a program |
| 11:02 | <hsivonen> | doh brainfart re compares everything twice |
| 11:03 | <hsivonen> | anyway, the profiler data is what it is. |
| 11:08 | <othermaciej> | sorting an already-sorted 16000 element array in JavaScript takes 10 milliseconds |
| 11:08 | <othermaciej> | (on Safari on a decent machine) |
| 11:08 | <othermaciej> | I would suspect something is broken about TreeSet |
| 11:11 | <hsivonen> | the factor 29 wasn't a time factor but number of invocations of compareTo factor |
| 11:11 | <hsivonen> | so it can't be CPU timing weirdness with the profiler |
| 11:40 | <zcorpan> | 2137 entities |
| 11:41 | <zcorpan> | it's like learning simplified chinese |
| 11:42 | <hsivonen> | I think this entity business is a bad idea, but I haven't gotten around to sending mail yet |
| 11:44 | <hsivonen> | annevk: re blog: http://www.photobasement.com/wp-content/uploads/2008/04/quotationmarks.jpg |
| 11:46 | <zcorpan> | wiki syndrome? |
| 12:05 | <annevk> | zcorpan, fwiw, I think you should drop the strict error reporting thing from dom5 |
| 12:09 | <jruderman> | hsivonen: nice photo |
| 12:12 | <gsnedders> | Philip`: http://pastebin.ca/974108 |
| 12:16 | <zcorpan> | annevk: isn't that too late given acid3? |
| 12:17 | <zcorpan> | annevk: or do you mean just drop the attribute and leave the strict behavior intact? |
| 12:18 | <annevk> | what does acid3 have to do with anything? |
| 12:18 | <zcorpan> | it checks that createElementNS('...', 'foo::') raises an exception, e.g. |
| 12:21 | <annevk> | I meant the method on document, fwiw |
| 12:24 | <zcorpan> | ok |
| 12:24 | <zcorpan> | although it would be nice to be able to create an html5 parser in js |
| 12:24 | <hsivonen> | so here I was parsing XML |
| 12:24 | <hsivonen> | and it went really slowly |
| 12:24 | <annevk> | zcorpan, document.innerHTML |
| 12:25 | <hsivonen> | until I realized I should prevent it from fetching DTDs from w3.org... |
| 12:25 | <zcorpan> | annevk: true, the dom3core attribute doesn't help legacy browsers anyway |
| 12:27 | zcorpan | uses del.icio.us as his dom5core issue tracker |
| 12:40 | <annevk> | hmm, entity changes are impossible to track using web-apps-tracker |
| 13:05 | <Philip`> | gsnedders: How come so many people spell "connection" wrong, but get every other header correct? |
| 13:06 | <gsnedders> | Philip`: there are plenty of rare mistakes, though |
| 13:06 | <gsnedders> | Philip`: like spaces and not hyphens comes up a fair bit |
| 13:07 | <gsnedders> | Philip`: but cneonction is caused by a proxy, I can't remember which |
| 13:07 | <gsnedders> | Philip`: it was to avoid keeping the connection open, IIRC |
| 13:07 | <gsnedders> | Philip`: there was a bizarre reason for it |
| 13:08 | <Philip`> | Ah, that's what http://www.nextthing.org/archives/2005/08/07/fun-with-http-headers says |
| 13:08 | <gsnedders> | the web is weird. |
| 13:10 | <toruvinn> | gsnedders, my guess would be the 'neon proxy', there was something like that. |
| 13:11 | <Philip`> | ("... I had a database with 2,686,155 page responses and 23,699,737 response headers. The actual downloading of all of this took about a week." - that sounds really quite slow) |
| 13:11 | <toruvinn> | haha, awesome page, Philip`. |
| 13:11 | <toruvinn> | thanks. |
| 13:13 | <gsnedders> | how do you plot something using gnuplot from a data file taking the log of one column of data? |
| 13:13 | <Philip`> | gsnedders: "set logscale x 2" might be what you want |
| 13:15 | <gsnedders> | Philip`: no, y |
| 13:15 | <Philip`> | or "plot 'foo.dat' using 1:log($2) ..." might be |
| 13:15 | <gsnedders> | But that's good enough :) |
| 13:16 | <gsnedders> | it still shows a huge long tail |
| 13:20 | <Philip`> | Everything has a long tail :-) |
| 13:22 | <annevk> | from the blog: "Can we get OOXML in HTML5 too? They seem to be very similar in their approaches to standardisation." |
| 13:22 | <annevk> | wtf, really... |
| 13:32 | <Philip`> | annevk: You might need to be more specific than "the blog", since there are several |
| 13:33 | <annevk> | oops, s/the/my/ |
| 13:33 | <Philip`> | Ah, that narrows it down sufficiently |
| 13:33 | <zcorpan> | annevk's blog is *the* blog, didn't you know? |
| 13:33 | <gsnedders> | I mean, nobody reads my blog |
| 13:33 | <gsnedders> | Except, maybe, James Holderness (who I strongly suspect does) |
| 13:35 | <hsivonen> | hmm. V.nu parser perf sucks compared to Xerces |
| 13:35 | <hsivonen> | so badly that I suspect it is IO buffering and nothing in the algorithm |
| 13:38 | <Philip`> | Need more abstraction, so you can use the same IO buffering in both implementations |
| 13:42 | gsnedders | hopes that email is pointless sending |
| 13:43 | <gsnedders> | http://lists.w3.org/Archives/Public/ietf-http-wg/2008AprJun/0124.html |
| 13:44 | <gsnedders> | (or rather, I hope that sending that email isn't pointless) |
| 13:44 | <Philip`> | Count the number of emails that have been sent in the past day; calculate how much better the world is today than it was yesterday; divide; conclude that all emails are almost entirely useless, and so you should stop writing them |
| 13:46 | <gsnedders> | That would be a time-saver. |
| 13:47 | gsnedders | concludes he MUST reply to what a girl sent him ages ago |
| 13:47 | <gsnedders> | and apologise for being so damned slow. |
| 13:50 | gsnedders | can't believe he actually just used an RFC2119 term there |
| 13:50 | <hsivonen> | Philip`: neither parser pegs the CPU, btw, which also points to IO |
| 13:51 | <Philip`> | It could also point to Thread.sleep calls, but I assume you've avoided doing that |
| 13:56 | <annevk> | zcorpan, there's more than one blog? |
| 13:57 | <hsivonen> | heh. the hotspot in V.nu is isNcname |
| 13:57 | <hsivonen> | which wouldn't be needed if the DOM impl. accepted any element name |
| 13:58 | <annevk> | isNcname is becoming easier in a few months, I think |
| 13:59 | hsivonen | changes the test setup from DOM to SaxTree |
| 14:05 | <zcorpan> | SaxTree doesn't do such checks? |
| 14:07 | <hsivonen> | zcorpan: it doesn't |
| 14:10 | <hsivonen> | hmm. I'll just try SAX with defaulthandler |
| 14:15 | <hsivonen> | a java.util.regex-based isNcname is incredibly bad |
| 14:27 | <hsivonen> | looks like it's all about how often they go and read from the underlying FileInputStream |
| 14:31 | <hsivonen> | Xerces has special UTF-8 decoding... |
| 14:35 | <hsivonen> | OK. I have created a bug in my bytes to UTF-8 buffering |
| 14:37 | <hsivonen> | bytes to UTF-16 that is |
| 15:47 | <hsivonen> | Hixie: I can now confirm that not calling JDK intern() really makes a difference |
| 16:02 | <hsivonen> | Hixie: is this on your radar: https://bugzilla.mozilla.org/show_bug.cgi?id=427329#c7 |
| 17:09 | <annevk> | hsivonen, I don't think we should start special casing the parser for that |
| 17:09 | <annevk> | fwiw |
| 19:35 | <hsivonen> | annevk: btw, the NCName thing isn't getting any better per spec--only worse |
| 19:35 | <hsivonen> | annevk: the point of checking for NCNames is to avoid exceptions in existing software--not as much to comply with XML infosets |
| 19:37 | <annevk> | ah |
| 20:13 | <Hixie> | hsivonen: it's not clear to me that the parser is the problem |
| 20:13 | <hsivonen> | Hixie: it's claimed that backing out the parser fix helps |
| 20:14 | <andersca> | hey Hixie |
| 20:28 | <Hixie> | hsivonen: i thought it was claimed that it didn't |
| 20:28 | <Hixie> | oh, my bad |
| 20:28 | <Hixie> | misread it |
| 21:35 | <hsivonen> | http://typophile.com/node/43971 |
| 21:36 | <annevk> | I wonder if it's really embedding |
| 21:36 | annevk | was just reading that |
| 21:39 | <annevk> | http://lists.w3.org/Archives/Public/www-style/2007Dec/thread.html#msg84 |
| 21:48 | <Hixie> | ok wtf |
| 21:49 | <Hixie> | "REPORT /webapps/!svn/bc/1409/source HTTP/1.1" is taking up insane amounts of CPU on my box |
| 21:49 | <annevk> | maybe html5.org? |
| 21:50 | <jgraham> | Er, that would be me |
| 21:50 | <Hixie> | aha! |
| 21:50 | <Hixie> | the magic of irc |
| 21:50 | <jgraham> | I didn't realise it would take up CPU on your box |
| 21:50 | <Hixie> | jgraham: go ahead, it's ok |
| 21:50 | jgraham | is ignorant |
| 21:50 | <Hixie> | jgraham: i'm sure you have legitimate reasons for it :-) |
| 21:50 | <Hixie> | jgraham: just making sure it wasn't some runaway script or something |
| 21:51 | <Hixie> | what is REPORT, anyway? |
| 21:51 | <Hixie> | svn blame? |
| 21:51 | <jgraham> | I was just wondering why html5lib's EOF handling appears to be different to the spec |
| 21:51 | <jgraham> | Hixie: yep |
| 21:51 | <Hixie> | cool |
| 21:53 | <hsivonen> | I wonder why multiple ns doesn't go to XHTML5 validation: http://www.w3.org/2008/03/validators-chart |
| 21:55 | <Hixie> | i wonder why text/html with DTD doesn't go to (x)html5 validator |
| 21:56 | <Hixie> | in fact that whole thing is WAY more complex than necessary or desirable |
| 21:56 | <Hixie> | where does it come from? |
| 21:57 | <annevk> | W3C :) |
| 21:58 | <hsivonen> | Hixie: http://lists.w3.org/Archives/Public/www-tag/2008Apr/0017.html |
| 21:58 | <annevk> | http://lists.w3.org/Archives/Public/www-validator/2008Apr/0014.html ? |
| 21:58 | <annevk> | Web page study: http://nikitathespider.com/articles/ByTheNumbers/ |
| 21:59 | <Hixie> | ah |
| 21:59 | <hsivonen> | enabling NVDL in Valdator.nu seems to be only a tiny bit of hacking away |
| 22:00 | <hsivonen> | once again there's some bad entity resolving that I need to fix |
| 22:07 | <hober> | this is awesome: http://nikitathespider.com/articles/ByTheNumbers/0803/MediaTypes.png |
| 22:07 | <Hixie> | looks basically like the numbers i got |
| 22:07 | <Hixie> | iirc i got 0.0044% to 0.2% depending on what kinds of pages i included |
| 22:08 | <Hixie> | (lower if i focused on the actively maintained web, higher if i included everything i could) |
| 22:09 | <Philip`> | I got 0.03% application/xhtml+xml from dmoz.org |
| 22:09 | <Philip`> | (and 99.8% text/html) |
| 22:10 | gsnedders | needs to get more HTTP headers :P |
| 22:10 | <Philip`> | gsnedders: Why? :-) |
| 22:10 | <gsnedders> | I mean, 1.1 million is nothing |
| 22:10 | <Philip`> | Depends on what you want to do with it |
| 22:10 | <gsnedders> | write a spec! :P |
| 22:11 | <Hixie> | i can never think of good examples for data-* |
| 22:11 | <Philip`> | People do real statistics with a sample size of hundreds - you don't always need billions :-) |
| 22:11 | <gsnedders> | Philip`: I know :) |
| 22:11 | <gsnedders> | Philip`: But it is all the edge cases that are helpful to have a large sample size for |
| 22:12 | <Philip`> | The web must be fractal, since you always find more edge cases when you look in more detail |
| 22:12 | <Hixie> | for writing the parser, i found that testing implementations was more useful than the data from the web |
| 22:12 | <Hixie> | but for defining new features, the data is invaluable |
| 22:13 | <Hixie> | i don't understand how people wrote specs without |
| 22:13 | <gsnedders> | Speaking of implementations, I need to email a guy at Opera |
| 22:13 | <gsnedders> | But I heart my left hand, and typig is slower than normal |
| 22:14 | <gsnedders> | typing, eve |
| 22:14 | <gsnedders> | *even |
| 22:14 | <Philip`> | Do you mean s/heart/hurt/ ? |
| 22:14 | <gsnedders> | yes |
| 22:15 | <Hixie> | Philip`, you're a braver man than i. i wasn't going to touch that one with a barge pole. |
| 22:15 | gsnedders | wonders what that one is |
| 22:15 | gsnedders | looks on the lists |
| 22:15 | <hober> | I imagine it was the s/// above |
| 22:15 | <gsnedders> | ah |
| 22:15 | <gsnedders> | oh dear… |
| 22:16 | <gsnedders> | now I realise… |
| 22:16 | <gsnedders> | I would say I ought to go hide in a corner because I didn't realise, but in this case, that's the wrong thing to say. |
| 22:19 | <gsnedders> | Hixie: Can I call you sick for just thinking of that? |
| 22:19 | <hober> | indeed. |
| 22:23 | <gsnedders> | Now, let me leave before I make an even more regrettable fuck up. |
| 22:33 | <Hixie> | annevk: what specs are you editor of these days? |
| 22:36 | <Hixie> | hey bloo |
| 22:36 | <blooberry> | hey hixie. 8-} |
| 22:36 | <Hixie> | wassup dude |
| 22:36 | <blooberry> | statistics. |
| 22:36 | <blooberry> | (trying to figure out how to present data and things) |
| 22:37 | <Hixie> | good times |
| 22:38 | <blooberry> | if you say so. ;-} *visions of standard deviations dancing through my head* |
| 22:41 | <Philip`> | Just say "the error bars are too small to show on this graph" |
| 22:41 | <blooberry> | I like that. 8-} |
| 22:41 | <Hixie> | hah |
| 22:42 | <andersca> | hey Hixie |
| 22:42 | <Hixie> | hey |
| 22:46 | <jgraham> | It's always good if you can claim the error bars aren't meaningful |
| 22:47 | <Hixie> | it's not at all clear to me what my error bars should actually be on some of my stats |
| 22:47 | <Hixie> | i mean, i can tell you exactly what the count was for the n billion pages |
| 22:47 | <Hixie> | it's not an estimate |
| 22:47 | <Hixie> | but since it's just a biased sample of an infinite number of pages... |
| 22:47 | <Hixie> | i don't know what to conclude |
| 22:48 | <jgraham> | Hixie: The error bars are supposed to represent the error on the population average based on the properties of your sample |
| 22:49 | <jgraham> | But since, as you note, you have a biased sample of the population it's not clear what that actually means |
| 22:49 | <Hixie> | so if n out of N pages had property X, what's the error on the population average for the property X? |
| 22:49 | <andersca> | Hixie: I have another application cache question for you |
| 22:49 | <Hixie> | go for it |
| 22:50 | <andersca> | Hixie: about the networking model |
| 22:50 | <andersca> | Hixie: so when a browsing context is associated with an application cache, all loads should go through the cache |
| 22:51 | <Hixie> | with the caveats defined in 4.6.5.1. Changes to the networking model, yes |
| 22:51 | <andersca> | yeah |
| 22:52 | <andersca> | now I understand that if I have a browsing context that is associated with an application cache |
| 22:52 | <andersca> | and the current document has a subframe, which is loaded from the cache |
| 22:52 | <andersca> | then that subframe's browsing context is not associated with an application cache? |
| 22:52 | <Hixie> | iirc the idea is that only the top-level browsing context matters |
| 22:53 | <Hixie> | but let me see if i can find that somewhere in the spec |
| 22:53 | <andersca> | the cache selection process will be invoked without a manifest URI for the subframe |
| 22:56 | <Hixie> | aha, found it |
| 22:56 | <Hixie> | "A child browsing context is always associated with the same browsing context as its parent browsing context, if any." |
| 22:56 | <Hixie> | from 4.6.2 Application caches |
| 22:57 | <jgraham> | Hixie: I think you just use the binomial std. deviation which is (pN(1-p))^0.5 |
| 22:57 | <Hixie> | jgraham: where p = n/N ? |
| 22:58 | <jgraham> | Hixie: Yeah |
| 22:58 | <Hixie> | jgraham: so sqrt(n*(1-(n/N))) ? |
| 22:58 | <jgraham> | Statistics is not something that I have done a lot of recently |
| 22:58 | <Philip`> | Hixie: sqrt(n*(n/N)*(1-n/N)), I think |
| 22:59 | <Philip`> | and then there's a 95% chance the population mean is within +/- 2 s.d. of the sample mean, I think |
| 22:59 | <Hixie> | so (pn(1-p))^0.5, not (pN(1-p))^0.5 |
| 23:00 | <annevk> | Hixie, http://wiki.whatwg.org/wiki/User:Annevk |
| 23:00 | <Philip`> | Hixie: Oops, I think I should have said sqrt(N*(n/N)*(1-n/N)) |
| 23:00 | <Hixie> | that's what jgraham said, right |
| 23:01 | <Hixie> | sqrt(n*(1-(n/N))) |
| 23:01 | <jgraham> | Yeah, that's what I said |
| 23:01 | <jgraham> | sqrt(N*(n/N)*(1-n/N) that is |
| 23:01 | <Philip`> | Oh, simplifying the multiplications makes it more complex to see if it's right :-) |
| 23:01 | <jgraham> | or at least what I meant |
| 23:02 | Philip` | suggests it is a premature optimisation |
| 23:04 | <Hixie> | that graph can't be right |
| 23:04 | <Philip`> | (That calculation of s.d. only works if n/N is sufficiently non-extreme, like 20 < n < N-20 or something) |
| 23:09 | <Hixie> | y=sqrt(x(1-(x/N))) for N=1e9 from x=0..N results in a pretty curve that crosses the x axis at 0 and N and that peaks at about y=5e4 |
| 23:10 | <Hixie> | which seems unintuitive if y really represents the likely error |
| 23:10 | <Hixie> | at x |
| 23:12 | <Hixie> | wait, n and N are almost certainly not the n and N i was talking about here |
| 23:12 | <Hixie> | i'm guessing n is the sample size and N the population size |
| 23:12 | <Hixie> | in which case i can't work out the error, since the population size is infinite, or at least unknowable |
| 23:12 | <jgraham> | No, N should be the sample size |
| 23:12 | <Philip`> | n is the number with property P out of the sample size of N, and the population is assumed to be infinite |
| 23:12 | <Hixie> | oh |
| 23:12 | <Hixie> | well then |
| 23:12 | <Hixie> | something is wrong |
| 23:12 | <Hixie> | for this graph doesn't make sense |
| 23:12 | <jgraham> | (imagine flipping coins; the population is infinite then too) |
| 23:13 | <Hixie> | there's no way that if i find 1000 page out of 1e9 that the error is less than if i find 10000 |
| 23:13 | <Hixie> | i guess it makes sense that the error would be symmetric |
| 23:14 | <Hixie> | about 50% |
| 23:14 | <Hixie> | since otherwise you could just define your problem as its reverse and your error would drop to zero |
| 23:14 | <Hixie> | but shouldn't the error for n 0.01% or n 99.99% be greater than for n 50%? |
| 23:15 | <Philip`> | It should peak at x=5e8, y=1.6e4, not at y=5e4, I think |
| 23:15 | <Hixie> | er yes, i meant 1.5e4 but the 1. was cut off on my display |
| 23:18 | <Hixie> | annevk: cool, thanks (re wiki page) |
| 23:20 | <Philip`> | Hixie: If you had a coin that gave heads 55% of the time, you wouldn't be surprised if it gave 50 heads out of 100 throws, because that's within expected random variation. But if you had a coin that gave heads 5% of the time, you would be surprised if you got 0 heads out of 100 throws (because the chance of that is 0.95^100 = 0.6%) |
| 23:21 | <Philip`> | So it's a 5% difference between sample and population means in both cases, but that's expected in the n/N=50 case and too extreme in the n/N=0 case |
| 23:21 | <Hixie> | fair enough |
| 23:21 | <Philip`> | so the expected variation is much lower nearer n=0 |
| 23:22 | <Hixie> | makes sense |
| 23:22 | Hixie | looks at the actual numbers |
| 23:22 | <Philip`> | (though the binomial normal approximation model breaks down when you actually get n=0) |
| 23:22 | <annevk> | Hixie, creating a new stats page? |
| 23:22 | <Hixie> | annevk: no, bloo made me think about it |
| 23:23 | <Hixie> | Philip`: so in a sample of 7e9 pages as my recent one, if i find 500 pages with a tag, that's really 500 +/- 22? |
| 23:23 | <Hixie> | i guess that makes sense |
| 23:23 | <jgraham> | Near n=0 it becomes poisson-like, right? So the error ~sqrt(n) |
| 23:24 | <Philip`> | Near n=0 I think you can just calculate the binomial directly, instead of approximating |
| 23:25 | <jgraham> | Right, but if you just want a good estimate and are lazy :) |
| 23:26 | <Philip`> | Hixie: 22 is the standard deviation, not the expected error - I think it's something like 66% chance that the sample mean is within +/- 1 s.d. of the true mean |
| 23:26 | <Philip`> | Hixie: so you want 2 s.d. (500 +/- 44) for 95% confidence |
| 23:26 | <Hixie> | ah right |
| 23:26 | <Philip`> | (I do hope I'm remembering this right...) |
| 23:27 | <Hixie> | this is all basically a complicated way of saying "we can't really tell anything for sure but we might as well assume it's all right" |
| 23:27 | <jgraham> | Philip`: The bit about std deviations is right |
| 23:27 | <Philip`> | The 95% thing means if you do this 20 times then you can expect to be wrong once, but hopefully only a little bit wrong :-) |
| 23:28 | <Hixie> | except i can't |
| 23:28 | <Hixie> | since i can't take a different sample |
| 23:28 | <jgraham> | But I _think_ you can make an estimate of the uncertainty on p using this method |
| 23:28 | <Hixie> | and i know the numbers precisely for my actual "sample" |
| 23:28 | <jgraham> | which is what you really care about |
| 23:28 | <Philip`> | Hixie: You should take a random sample of your 7e9 pages, and then you could do proper statistics on that, using the 7e9 as the population :-) |
| 23:29 | <Hixie> | that would be worthless |
| 23:29 | <Hixie> | since i can just do it on the whole thing! |
| 23:29 | <jgraham> | (like naievely you could say the probability of a page containing the tag is 500/7e9 +/- 22/7e9, only if might be more complicated than that) |
| 23:30 | <jgraham> | s/naievely/naively/ |
| 23:30 | <jgraham> | s/if/it/ |
| 23:30 | <Hixie> | if it's 500 +/- 44 out of N to have 95% confidence that the same proportion applies in the population as a whole |
| 23:30 | <Hixie> | that means that out of any random sample of N pages, there'll be 500-44 to 500+44 out of N that have this feature |
| 23:31 | <Hixie> | right? |
| 23:31 | <Hixie> | which is basically no error |
| 23:31 | <Hixie> | i mean, on the cosmic scale of things |
| 23:31 | <Philip`> | If they're random samples from the same infinitely large population (where "infinitely large" means "much larger than the sample size"), then yes |
| 23:31 | <jgraham> | http://en.wikipedia.org/wiki/Margin_of_error |
| 23:32 | <jgraham> | Philip`: Which brings me back to the point about it being good if you can say the error bars are meaningless |
| 23:32 | <Philip`> | (because obviously if sample size = population size then you'll find precisely 500 in any sample of size N, so you have to assume infinite population to make sure the samples are independent, I think) |
| 23:32 | <Hixie> | i'm going to continue pretending that the margin of error is as close to 0 as makes no difference so long as i find something on more than 10000 or so pages |
| 23:33 | <Hixie> | Philip`: yeah, the sad thing here is that the samples aren't at all random for me. They're the N most interesting pages, for some pretty precise and known-useful definition of interesting |
| 23:33 | <jgraham> | Hixie: Well if you can get a variation from 0.2-0.044 depending on which pages you sample you're dominated by systematic error anyway |
| 23:34 | <Hixie> | jgraham: exactly |
| 23:34 | <Philip`> | It seems unlikely that 456 vs 544 pages using some feature would have any practical significance on design decisions, which is all that really matters |
| 23:34 | <Hixie> | right |
| 23:38 | <Philip`> | Mostly it's just nice to not use too many decimal places when presenting data, like 1e4 out of 7e9 should be 0.00014% and not 0.0001429%, because meaningless decimal places remind me of physics lessons :-) |
| 23:39 | <Hixie> | yeah well in my case i have to round the data and add in some error anyway to keep the data from being too accurate |
| 23:39 | <Hixie> | so |
| 23:39 | <Hixie> | :-) |
| 23:39 | <Philip`> | How do you know if you're adding enough error? :-) |
| 23:40 | <Hixie> | i'm pretty sure i add enough |
| 23:40 | <Hixie> | and that's all i'll say about that :-P |
| 23:42 | <takkaria> | Hixie: where's the html5 svn repo viewer online? I can't seem to find it |
| 23:42 | <Hixie> | there's a link at the top of the spec |
| 23:43 | <takkaria> | that figures. :) ta |
| 23:44 | Philip` | is slightly reminded of Cryptonomicon, calculating exactly how much of the collected signals intelligence could be used before it would become sufficiently accurate that it would reveal its source |
| 23:44 | <Hixie> | yeah |
| 23:44 | <Philip`> | except this isn't quite as serious as a war |
| 23:44 | <Hixie> | indeed |
| 23:45 | <Hixie> | one of the things i do is report numbers for different characteristics from samples collected at different times |
| 23:45 | <Hixie> | so the numbers aren't self-consistent even if you try to combine them |
| 23:46 | <Hixie> | (they're close enough though) |
| 23:46 | <Hixie> | (to draw conclusions from for the spec, i mean) |
| 23:46 | <Philip`> | It's nice to work in areas that are trivial in the grand scheme of things, like HTML, so it doesn't matter when you mess up :-) |
| 23:47 | <Hixie> | yeah really |
| 23:47 | <Hixie> | we can have a big impact, but if we screw up, oh well! no biggie |
| 23:48 | <Philip`> | The internet is a demonstration that you can mess up quite a large number of things and we'll still carry on just fine |
| 23:52 | <Hixie> | aaah |
| 23:53 | <Hixie> | i broke mathml |
| 23:53 | <Hixie> | and didn't notice |
| 23:53 | <Hixie> | crap |
| 23:54 | <Hixie> | how do we handle <mglyph> |
| 23:54 | <fearphage> | http://files.myopera.com/fearphage/static/bugs.xhtml?http://my.imaginary/site/ this document was originally made and served as text/html. not its served as application/xhtml+xml. can anyone tell me why #299801 (3rd test from the bottom) is failing and how to make it pass (if possible). the problem revolves around xml + document.evaluate with a null namespace |
| 23:54 | <fearphage> | is there a way to query xml nodes using xpath with a null namespace? |