00:57
<takkaria>
annevk: on your latest post, first para, s/totaled/totalled/ and s/amount/number/
00:59
<annevk>
ty, fixed
01:00
<annevk>
night all
01:00
annevk
-> bed
01:58
<annevk>
http://lists.w3.org/Archives/Public/public-xhtml2/2008Apr/0020.html
01:58
annevk
can't sleep (for those who read the earlier message)
02:12
<BenMillard>
it was snowing here yesterday morning...very unusual for this part of the UK
02:13
<BenMillard>
especially in April!
02:16
<Philip`>
Unfortunately it had all melted by the time I woke up (around 1pm) :-(
02:16
<BenMillard>
I took some photos...I suck at doing that, though they will appear on my blog eventually
02:21
<tomg>
it was nice
02:21
<tomg>
was shocked at how white it was
02:23
<BenMillard>
indeed, it was proper powder snow
02:23
<tomg>
only light snow was forecast
02:26
Hixie
e-mails the xhtml2 group
02:41
<MikeSmith>
BenMillard - can you give a brief summary of the goals of your ongoing table work?
02:41
<MikeSmith>
is there are particular hypothesis you're working under?
02:42
<BenMillard>
it's mostly to understand how tables get authored in reality, so we can design HTML5 features to be more realistic and robust
02:42
<BenMillard>
particularly the table header association mechanism
02:43
<BenMillard>
where "we" means HTMLWG in general
02:44
<BenMillard>
I should probably summarise the goals in the document :P
02:45
<MikeSmith>
BenMillard - yeah, that would be useful, I think
02:46
<MikeSmith>
summary of what you've done so far, and what else you want to do going forward
02:46
<MikeSmith>
what specific product/deliverable you have in mind
02:46
<BenMillard>
the document starts with a "Numbers" section which summarises what's been done
02:47
<MikeSmith>
for example, we could end up producing/publishing a W3C Note from it
02:47
<MikeSmith>
BenMillard - what's the URL?
02:47
<BenMillard>
oh, sorry I thought you were looking at it! it's here: http://sitesurgeon.co.uk/tables/
02:48
<BenMillard>
deliverables include helping develop the HTML5 header association algorithm with other HTMLWG participants (James Graham and Simon Pieters, mostly)
02:49
<MikeSmith>
BenMillard - hmm, seems like that wouldn't be a specific deliverable but instead something the deliverable/document could be used for
02:50
<MikeSmith>
have you gotten much feedback from WAI folks about it yet?
02:50
<BenMillard>
I don't recall talking to any WAI people about it yet
02:50
<Hixie>
http://www.w3.org/2003/entities/2007xml/unicode.xml has a last-modified date of 1970
02:50
<Hixie>
go w3c!
02:50
<MikeSmith>
heh
02:51
<BenMillard>
MikeSmith, it was done in my spare time so I've not been tracking feedback around it
02:51
Hixie
removes the -N from his command line so that hsi script will always download the file instead of assuming that it has an accurate date
02:52
<Hixie>
oh actually
02:52
<Hixie>
that could be a minefield bug
02:52
<Hixie>
with xslt
02:52
<Hixie>
nevermind
02:52
<MikeSmith>
Hixie - you mean the headers?
02:52
<MikeSmith>
Last-Modified: Sun, 06 Apr 2008 10:46:09
02:52
<Hixie>
yeah
02:52
<Hixie>
yeah, minefield bug with xslt
02:53
<Hixie>
lovely
02:53
<MikeSmith>
ah, OK
02:53
<MikeSmith>
go minefield!
02:54
<BenMillard>
MikeSmith, yes the document itself does not contain a product
02:54
<MikeSmith>
BenMillard - fwiw, I think it might be worthwhile to post the URL and short summary to one of the WAI lists
02:54
<MikeSmith>
if you've not done that already
02:55
<BenMillard>
it's not under active development at the moment; I can't afford the time
02:56
<BenMillard>
I'd like it to be used by anyone who finds it useful, though
02:56
MikeSmith
is getting lots of "Connection reset by peer http://intertwingly.net/blog/index.atom"; when trying to get Sam's feed
02:56
<MikeSmith>
BenMillard - I see
02:57
<BenMillard>
I've been seeking sponsorship from various places to make the work more sustainable
02:57
<BenMillard>
I took the whole of last month off work to try and speed that up, but haven't quite got there
03:02
<MikeSmith>
BenMillard - I would think investing some time in doing a little awareness-raising about might help with getting others to consider sponsoring further work
03:02
<MikeSmith>
e.g., by posting about it to WAI lists or elsewhere
03:03
<BenMillard>
MikeSmith, that's a good idea. can you suggest a specific list for me to post it to? I've only sent it to public-html before now.
03:05
<MikeSmith>
BenMillard - no, I can't suggest any specific list. I don't follow them. Karl Dubost might know
03:06
<MikeSmith>
might be worthwhile for you to get in touch with Michael Cooper and/or Shawn Henry
03:06
<MikeSmith>
both W3C staff for WAI
03:07
<BenMillard>
I've spoken to Shawn Henry in person in November 2007; I could ask her
03:10
<MikeSmith>
BenMillard - yeah, she would certainly be able to tell you
03:11
MikeSmith
drops off to head into office
03:11
<BenMillard>
MikeSmith, thanks for your advice about this. I'm a noob when it comes to W3C!
03:11
<BenMillard>
oh bugger, he left just before I sent that
03:35
<jwalden>
BenMillard: ^
03:38
<MikeSmith>
BenMillard - I'm a noob too.. I've worked for the W3C only for a year now. most people here actually have a lot more experience with the W3C than I do
03:40
<MikeSmith>
I'm sort of like a foreign element that's been inserted, and not clear yet if it's going to make things worse or better
03:42
<jwalden>
<foreignElement id="MikeSmith"/>
03:49
<BenMillard>
MikeSmith, that's cool
03:49
MikeSmith
reads dbaron's latest
03:50
<MikeSmith>
http://dbaron.org/log/20080406-acid3
03:50
<MikeSmith>
"Teaching to the Test" (on HTML5 and Acid3 and Firefox)
04:17
<BenMillard>
MikeSmith, I've updated the tables document: http://sitesurgeon.co.uk/tables/
04:18
<BenMillard>
away for lunch now (at 4am local time!)
05:22
<BenMillard>
MikeSmith, I've updated the tables document http://sitesurgeon.co.uk/tables/ is this better?
05:46
<MikeSmith>
BenMillard - which parts did you change/add?
05:46
<MikeSmith>
The Goals and Deliverables part?
05:46
<BenMillard>
that's right
05:47
<BenMillard>
I also moved the Feedback section to the top to emphasise the openness of it's development
05:49
<BenMillard>
MikeSmith, using the UserName, Message format is more compatible with IRC clients and the #whatwg log than using UserName - Message
05:49
<BenMillard>
e.g.:
05:49
<BenMillard>
BenMillard, which parts did you change/add?
05:49
<MikeSmith>
um, thanks for the tip
05:50
<Hixie>
personally i'm more of a foo: bar fan than a foo, bar fan
05:51
<BenMillard>
either of those are compatible with IRC clients and the log, afaict, but foo - bar is not
05:52
<MikeSmith>
BenMillard - I don't know what you mean by compatible with IRC clients and the log
05:52
<MikeSmith>
not that I really care to get into a discussion about it
05:53
<BenMillard>
the highlight lines directed at you when either of the conventional methods are followed
05:53
<BenMillard>
*they
05:54
<BenMillard>
so if you address me as BenMillard, message it gets highlighted and I'm more likely to see it
05:54
<BenMillard>
(or BenMillard: message)
05:55
<Hixie>
either works with a decent irc client like irssi :-)
05:55
<MikeSmith>
BenMillard - yeah, I'd say that's problem between you and your IRC client
05:55
jwalden
snickers
05:56
<jwalden>
don't most clients get set off when your nick appears anywhere in the line, surrounded by \b ?
05:58
<BenMillard>
the , and : forms are the only ones I've found interoperable...is there a standard for this?
05:59
<jwalden>
standard? IRC? hah!
06:00
<jwalden>
or so I've been led to believe
06:02
<MikeSmith>
I think the standard for IRC is "please leave your %s at the door"
06:03
<MikeSmith>
jwalden - fwiw, XChat highlights the nick of the user who uses your nick in a message
06:03
<MikeSmith>
not your own nick
06:03
<MikeSmith>
which sorta makes sense to me
06:04
<MikeSmith>
I know my own name
06:04
<MikeSmith>
(most of the time)
06:04
<jwalden>
I used the phrase "set off" because it probably differs across clients; Chatzilla highlights the individual message, I assume others do other thigns
06:04
<MikeSmith>
the useful piece of information for this case seems to be, who's calling?
06:04
<jwalden>
s/gns/ngs/
06:06
<BenMillard>
the IRC logs for this channel are athttp://krijnhoetmer.nl/irc-logs/ and it support both the , and : forms. so if you use one of those forms, while in this channel, the highlighting feature will work in the logs
06:06
<BenMillard>
although I agree that any message which mentions a user's name should light up in that user's client
07:12
<hsivonen>
http://bitworking.org/news/317/Revisionist-XHistory
07:12
<othermaciej>
hi all
07:13
<hsivonen>
hi
07:27
<hsivonen>
surprisingly, one of the hotspots of Validator.nu is TreeSet doing a silly number of compares when the inserted items are already ordered or reverse-ordered.
07:28
<hsivonen>
too bad the JDK and Commons Collections don't seem to have head/tail-biased linked list-backed SortedSets.
07:29
<othermaciej>
that's not a good way to make a sorted data structure
07:30
<othermaciej>
(it's a good way to make an ordered associative data structure)
07:30
<hsivonen>
othermaciej: what's not good? TreeSet?
07:30
<othermaciej>
no, a linked-list backed set
07:30
<othermaciej>
a TreeSet (assuming it's a balanced tree) is a good way to make a balanced associative structure
07:31
<othermaciej>
but inserting N items into it should be N log N so it's surprising it would be a hot spot
07:31
<hsivonen>
othermaciej: why not if you know the insertion will always be either to head of the list or the next from head?
07:31
<othermaciej>
if you have items that are already ordered, you would want to use a ListHashSet
07:31
<othermaciej>
(I think that's what Java calls it)
07:32
<hsivonen>
othermaciej: the insertions are *almost* ordered
07:32
<hsivonen>
that is, the new insertion is most often to the head of the list, but sometimes a slot or two further
07:33
<hsivonen>
(LinkedHashSet is not what I need here)
07:33
<othermaciej>
that doesn't have an insertBefore?
07:33
<othermaciej>
(too bad, it should be easy to do)
07:35
<hsivonen>
it appears it does not
07:39
<hsivonen>
the show source feature ends up comparing locations 29 times the number of location objects
07:39
<hsivonen>
that's not good
08:08
<BenMillard>
annevk, you wrote "heh, RDF fanatics use " and <em> for quotes" and I see things like that quite frequently on the blogs of markup/accessibility enthusiasts/experts
08:12
<BenMillard>
indeed, it's hard to find anyone using <q>...present company excepted :)
08:13
<BenMillard>
getting the right punctuation seems more important to authors than using the right element
08:17
<jwalden>
problem being q's quotation behavior (CSS's, that is) is underspecified, as dbaron tells me
08:18
<BenMillard>
Hixie, I made editorial changes to http://sitesurgeon.co.uk/tables/ which include clarifying the markup used by each group in "How Authors Indicate Headers in Data Tables". They were a bit vague before. Let me know if this changes anything.
08:18
<Hixie>
k
08:19
<Hixie>
i probably won't look at table stuff for some time
09:09
zcorpan_
did not know about Document.strictErrorChecking
09:10
zcorpan_
will define in dom5core what happens when it is false
09:24
<BenMillard>
Philip`, my (badly taken) photos of UK snow are now blogged: http://projectcerbera.com/blog/2008/04#day06
09:46
<jwalden>
oh man
09:46
<jwalden>
that looks AWESOME
10:55
<hsivonen>
othermaciej: swiching to HeadBiasedSortedSet and TailBiasetSortedSet changed the comparison patterns to the better by approximately a factor of 29
10:56
<hsivonen>
I don't know what kind of balanced tree TreeSet has, but it sure compares a *lot*
10:57
<othermaciej>
how big is your data set?
10:57
<hsivonen>
othermaciej: the HTML 5 spec
10:57
<hsivonen>
about 16000 items in the set
10:58
<othermaciej>
log base 2 of that is 14
10:58
<othermaciej>
must not be that well balanced
10:58
<othermaciej>
(well, close to 14)
10:58
<hsivonen>
or it compares everything twice
10:58
<hsivonen>
or something
10:59
<hsivonen>
the next big hotspots are IO and XPath
11:01
<hsivonen>
making IO go away is hard, but making XPath go away is quite doable and something I want to do anyway
11:02
<othermaciej>
I am still somewhat surprised that inserting into a tree-based data structure could be the top hot spot for a program
11:02
<hsivonen>
doh brainfart re compares everything twice
11:03
<hsivonen>
anyway, the profiler data is what it is.
11:08
<othermaciej>
sorting an already-sorted 16000 element array in JavaScript takes 10 milliseconds
11:08
<othermaciej>
(on Safari on a decent machine)
11:08
<othermaciej>
I would suspect something is broken about TreeSet
11:11
<hsivonen>
the factor 29 wasn't a time factor but number of invocations of compareTo factor
11:11
<hsivonen>
so it can't be CPU timing weirdness with the profiler
11:40
<zcorpan>
2137 entities
11:41
<zcorpan>
it's like learning simplified chinese
11:42
<hsivonen>
I think this entity business is a bad idea, but I haven't gotten around to sending mail yet
11:44
<hsivonen>
annevk: re blog: http://www.photobasement.com/wp-content/uploads/2008/04/quotationmarks.jpg
11:46
<zcorpan>
wiki syndrome?
12:05
<annevk>
zcorpan, fwiw, I think you should drop the strict error reporting thing from dom5
12:09
<jruderman>
hsivonen: nice photo
12:12
<gsnedders>
Philip`: http://pastebin.ca/974108
12:16
<zcorpan>
annevk: isn't that too late given acid3?
12:17
<zcorpan>
annevk: or do you mean just drop the attribute and leave the strict behavior intact?
12:18
<annevk>
what does acid3 have to do with anything?
12:18
<zcorpan>
it checks that createElementNS('...', 'foo::') raises an exception, e.g.
12:21
<annevk>
I meant the method on document, fwiw
12:24
<zcorpan>
ok
12:24
<zcorpan>
although it would be nice to be able to create an html5 parser in js
12:24
<hsivonen>
so here I was parsing XML
12:24
<hsivonen>
and it went really slowly
12:24
<annevk>
zcorpan, document.innerHTML
12:25
<hsivonen>
until I realized I should prevent it from fetching DTDs from w3.org...
12:25
<zcorpan>
annevk: true, the dom3core attribute doesn't help legacy browsers anyway
12:27
zcorpan
uses del.icio.us as his dom5core issue tracker
12:40
<annevk>
hmm, entity changes are impossible to track using web-apps-tracker
13:05
<Philip`>
gsnedders: How come so many people spell "connection" wrong, but get every other header correct?
13:06
<gsnedders>
Philip`: there are plenty of rare mistakes, though
13:06
<gsnedders>
Philip`: like spaces and not hyphens comes up a fair bit
13:07
<gsnedders>
Philip`: but cneonction is caused by a proxy, I can't remember which
13:07
<gsnedders>
Philip`: it was to avoid keeping the connection open, IIRC
13:07
<gsnedders>
Philip`: there was a bizarre reason for it
13:08
<Philip`>
Ah, that's what http://www.nextthing.org/archives/2005/08/07/fun-with-http-headers says
13:08
<gsnedders>
the web is weird.
13:10
<toruvinn>
gsnedders, my guess would be the 'neon proxy', there was something like that.
13:11
<Philip`>
("... I had a database with 2,686,155 page responses and 23,699,737 response headers. The actual downloading of all of this took about a week." - that sounds really quite slow)
13:11
<toruvinn>
haha, awesome page, Philip`.
13:11
<toruvinn>
thanks.
13:13
<gsnedders>
how do you plot something using gnuplot from a data file taking the log of one column of data?
13:13
<Philip`>
gsnedders: "set logscale x 2" might be what you want
13:15
<gsnedders>
Philip`: no, y
13:15
<Philip`>
or "plot 'foo.dat' using 1:log($2) ..." might be
13:15
<gsnedders>
But that's good enough :)
13:16
<gsnedders>
it still shows a huge long tail
13:20
<Philip`>
Everything has a long tail :-)
13:22
<annevk>
from the blog: "Can we get OOXML in HTML5 too? They seem to be very similar in their approaches to standardisation."
13:22
<annevk>
wtf, really...
13:32
<Philip`>
annevk: You might need to be more specific than "the blog", since there are several
13:33
<annevk>
oops, s/the/my/
13:33
<Philip`>
Ah, that narrows it down sufficiently
13:33
<zcorpan>
annevk's blog is *the* blog, didn't you know?
13:33
<gsnedders>
I mean, nobody reads my blog
13:33
<gsnedders>
Except, maybe, James Holderness (who I strongly suspect does)
13:35
<hsivonen>
hmm. V.nu parser perf sucks compared to Xerces
13:35
<hsivonen>
so badly that I suspect it is IO buffering and nothing in the algorithm
13:38
<Philip`>
Need more abstraction, so you can use the same IO buffering in both implementations
13:42
gsnedders
hopes that email is pointless sending
13:43
<gsnedders>
http://lists.w3.org/Archives/Public/ietf-http-wg/2008AprJun/0124.html
13:44
<gsnedders>
(or rather, I hope that sending that email isn't pointless)
13:44
<Philip`>
Count the number of emails that have been sent in the past day; calculate how much better the world is today than it was yesterday; divide; conclude that all emails are almost entirely useless, and so you should stop writing them
13:46
<gsnedders>
That would be a time-saver.
13:47
gsnedders
concludes he MUST reply to what a girl sent him ages ago
13:47
<gsnedders>
and apologise for being so damned slow.
13:50
gsnedders
can't believe he actually just used an RFC2119 term there
13:50
<hsivonen>
Philip`: neither parser pegs the CPU, btw, which also points to IO
13:51
<Philip`>
It could also point to Thread.sleep calls, but I assume you've avoided doing that
13:56
<annevk>
zcorpan, there's more than one blog?
13:57
<hsivonen>
heh. the hotspot in V.nu is isNcname
13:57
<hsivonen>
which wouldn't be needed if the DOM impl. accepted any element name
13:58
<annevk>
isNcname is becoming easier in a few months, I think
13:59
hsivonen
changes the test setup from DOM to SaxTree
14:05
<zcorpan>
SaxTree doesn't do such checks?
14:07
<hsivonen>
zcorpan: it doesn't
14:10
<hsivonen>
hmm. I'll just try SAX with defaulthandler
14:15
<hsivonen>
a java.util.regex-based isNcname is incredibly bad
14:27
<hsivonen>
looks like it's all about how often they go and read from the underlying FileInputStream
14:31
<hsivonen>
Xerces has special UTF-8 decoding...
14:35
<hsivonen>
OK. I have created a bug in my bytes to UTF-8 buffering
14:37
<hsivonen>
bytes to UTF-16 that is
15:47
<hsivonen>
Hixie: I can now confirm that not calling JDK intern() really makes a difference
16:02
<hsivonen>
Hixie: is this on your radar: https://bugzilla.mozilla.org/show_bug.cgi?id=427329#c7
17:09
<annevk>
hsivonen, I don't think we should start special casing the parser for that
17:09
<annevk>
fwiw
19:35
<hsivonen>
annevk: btw, the NCName thing isn't getting any better per spec--only worse
19:35
<hsivonen>
annevk: the point of checking for NCNames is to avoid exceptions in existing software--not as much to comply with XML infosets
19:37
<annevk>
ah
20:13
<Hixie>
hsivonen: it's not clear to me that the parser is the problem
20:13
<hsivonen>
Hixie: it's claimed that backing out the parser fix helps
20:14
<andersca>
hey Hixie
20:28
<Hixie>
hsivonen: i thought it was claimed that it didn't
20:28
<Hixie>
oh, my bad
20:28
<Hixie>
misread it
21:35
<hsivonen>
http://typophile.com/node/43971
21:36
<annevk>
I wonder if it's really embedding
21:36
annevk
was just reading that
21:39
<annevk>
http://lists.w3.org/Archives/Public/www-style/2007Dec/thread.html#msg84
21:48
<Hixie>
ok wtf
21:49
<Hixie>
"REPORT /webapps/!svn/bc/1409/source HTTP/1.1" is taking up insane amounts of CPU on my box
21:49
<annevk>
maybe html5.org?
21:50
<jgraham>
Er, that would be me
21:50
<Hixie>
aha!
21:50
<Hixie>
the magic of irc
21:50
<jgraham>
I didn't realise it would take up CPU on your box
21:50
<Hixie>
jgraham: go ahead, it's ok
21:50
jgraham
is ignorant
21:50
<Hixie>
jgraham: i'm sure you have legitimate reasons for it :-)
21:50
<Hixie>
jgraham: just making sure it wasn't some runaway script or something
21:51
<Hixie>
what is REPORT, anyway?
21:51
<Hixie>
svn blame?
21:51
<jgraham>
I was just wondering why html5lib's EOF handling appears to be different to the spec
21:51
<jgraham>
Hixie: yep
21:51
<Hixie>
cool
21:53
<hsivonen>
I wonder why multiple ns doesn't go to XHTML5 validation: http://www.w3.org/2008/03/validators-chart
21:55
<Hixie>
i wonder why text/html with DTD doesn't go to (x)html5 validator
21:56
<Hixie>
in fact that whole thing is WAY more complex than necessary or desirable
21:56
<Hixie>
where does it come from?
21:57
<annevk>
W3C :)
21:58
<hsivonen>
Hixie: http://lists.w3.org/Archives/Public/www-tag/2008Apr/0017.html
21:58
<annevk>
http://lists.w3.org/Archives/Public/www-validator/2008Apr/0014.html ?
21:58
<annevk>
Web page study: http://nikitathespider.com/articles/ByTheNumbers/
21:59
<Hixie>
ah
21:59
<hsivonen>
enabling NVDL in Valdator.nu seems to be only a tiny bit of hacking away
22:00
<hsivonen>
once again there's some bad entity resolving that I need to fix
22:07
<hober>
this is awesome: http://nikitathespider.com/articles/ByTheNumbers/0803/MediaTypes.png
22:07
<Hixie>
looks basically like the numbers i got
22:07
<Hixie>
iirc i got 0.0044% to 0.2% depending on what kinds of pages i included
22:08
<Hixie>
(lower if i focused on the actively maintained web, higher if i included everything i could)
22:09
<Philip`>
I got 0.03% application/xhtml+xml from dmoz.org
22:09
<Philip`>
(and 99.8% text/html)
22:10
gsnedders
needs to get more HTTP headers :P
22:10
<Philip`>
gsnedders: Why? :-)
22:10
<gsnedders>
I mean, 1.1 million is nothing
22:10
<Philip`>
Depends on what you want to do with it
22:10
<gsnedders>
write a spec! :P
22:11
<Hixie>
i can never think of good examples for data-*
22:11
<Philip`>
People do real statistics with a sample size of hundreds - you don't always need billions :-)
22:11
<gsnedders>
Philip`: I know :)
22:11
<gsnedders>
Philip`: But it is all the edge cases that are helpful to have a large sample size for
22:12
<Philip`>
The web must be fractal, since you always find more edge cases when you look in more detail
22:12
<Hixie>
for writing the parser, i found that testing implementations was more useful than the data from the web
22:12
<Hixie>
but for defining new features, the data is invaluable
22:13
<Hixie>
i don't understand how people wrote specs without
22:13
<gsnedders>
Speaking of implementations, I need to email a guy at Opera
22:13
<gsnedders>
But I heart my left hand, and typig is slower than normal
22:14
<gsnedders>
typing, eve
22:14
<gsnedders>
*even
22:14
<Philip`>
Do you mean s/heart/hurt/ ?
22:14
<gsnedders>
yes
22:15
<Hixie>
Philip`, you're a braver man than i. i wasn't going to touch that one with a barge pole.
22:15
gsnedders
wonders what that one is
22:15
gsnedders
looks on the lists
22:15
<hober>
I imagine it was the s/// above
22:15
<gsnedders>
ah
22:15
<gsnedders>
oh dear…
22:16
<gsnedders>
now I realise…
22:16
<gsnedders>
I would say I ought to go hide in a corner because I didn't realise, but in this case, that's the wrong thing to say.
22:19
<gsnedders>
Hixie: Can I call you sick for just thinking of that?
22:19
<hober>
indeed.
22:23
<gsnedders>
Now, let me leave before I make an even more regrettable fuck up.
22:33
<Hixie>
annevk: what specs are you editor of these days?
22:36
<Hixie>
hey bloo
22:36
<blooberry>
hey hixie. 8-}
22:36
<Hixie>
wassup dude
22:36
<blooberry>
statistics.
22:36
<blooberry>
(trying to figure out how to present data and things)
22:37
<Hixie>
good times
22:38
<blooberry>
if you say so. ;-} *visions of standard deviations dancing through my head*
22:41
<Philip`>
Just say "the error bars are too small to show on this graph"
22:41
<blooberry>
I like that. 8-}
22:41
<Hixie>
hah
22:42
<andersca>
hey Hixie
22:42
<Hixie>
hey
22:46
<jgraham>
It's always good if you can claim the error bars aren't meaningful
22:47
<Hixie>
it's not at all clear to me what my error bars should actually be on some of my stats
22:47
<Hixie>
i mean, i can tell you exactly what the count was for the n billion pages
22:47
<Hixie>
it's not an estimate
22:47
<Hixie>
but since it's just a biased sample of an infinite number of pages...
22:47
<Hixie>
i don't know what to conclude
22:48
<jgraham>
Hixie: The error bars are supposed to represent the error on the population average based on the properties of your sample
22:49
<jgraham>
But since, as you note, you have a biased sample of the population it's not clear what that actually means
22:49
<Hixie>
so if n out of N pages had property X, what's the error on the population average for the property X?
22:49
<andersca>
Hixie: I have another application cache question for you
22:49
<Hixie>
go for it
22:50
<andersca>
Hixie: about the networking model
22:50
<andersca>
Hixie: so when a browsing context is associated with an application cache, all loads should go through the cache
22:51
<Hixie>
with the caveats defined in 4.6.5.1. Changes to the networking model, yes
22:51
<andersca>
yeah
22:52
<andersca>
now I understand that if I have a browsing context that is associated with an application cache
22:52
<andersca>
and the current document has a subframe, which is loaded from the cache
22:52
<andersca>
then that subframe's browsing context is not associated with an application cache?
22:52
<Hixie>
iirc the idea is that only the top-level browsing context matters
22:53
<Hixie>
but let me see if i can find that somewhere in the spec
22:53
<andersca>
the cache selection process will be invoked without a manifest URI for the subframe
22:56
<Hixie>
aha, found it
22:56
<Hixie>
"A child browsing context is always associated with the same browsing context as its parent browsing context, if any."
22:56
<Hixie>
from 4.6.2 Application caches
22:57
<jgraham>
Hixie: I think you just use the binomial std. deviation which is (pN(1-p))^0.5
22:57
<Hixie>
jgraham: where p = n/N ?
22:58
<jgraham>
Hixie: Yeah
22:58
<Hixie>
jgraham: so sqrt(n*(1-(n/N))) ?
22:58
<jgraham>
Statistics is not something that I have done a lot of recently
22:58
<Philip`>
Hixie: sqrt(n*(n/N)*(1-n/N)), I think
22:59
<Philip`>
and then there's a 95% chance the population mean is within +/- 2 s.d. of the sample mean, I think
22:59
<Hixie>
so (pn(1-p))^0.5, not (pN(1-p))^0.5
23:00
<annevk>
Hixie, http://wiki.whatwg.org/wiki/User:Annevk
23:00
<Philip`>
Hixie: Oops, I think I should have said sqrt(N*(n/N)*(1-n/N))
23:00
<Hixie>
that's what jgraham said, right
23:01
<Hixie>
sqrt(n*(1-(n/N)))
23:01
<jgraham>
Yeah, that's what I said
23:01
<jgraham>
sqrt(N*(n/N)*(1-n/N) that is
23:01
<Philip`>
Oh, simplifying the multiplications makes it more complex to see if it's right :-)
23:01
<jgraham>
or at least what I meant
23:02
Philip`
suggests it is a premature optimisation
23:04
<Hixie>
that graph can't be right
23:04
<Philip`>
(That calculation of s.d. only works if n/N is sufficiently non-extreme, like 20 < n < N-20 or something)
23:09
<Hixie>
y=sqrt(x(1-(x/N))) for N=1e9 from x=0..N results in a pretty curve that crosses the x axis at 0 and N and that peaks at about y=5e4
23:10
<Hixie>
which seems unintuitive if y really represents the likely error
23:10
<Hixie>
at x
23:12
<Hixie>
wait, n and N are almost certainly not the n and N i was talking about here
23:12
<Hixie>
i'm guessing n is the sample size and N the population size
23:12
<Hixie>
in which case i can't work out the error, since the population size is infinite, or at least unknowable
23:12
<jgraham>
No, N should be the sample size
23:12
<Philip`>
n is the number with property P out of the sample size of N, and the population is assumed to be infinite
23:12
<Hixie>
oh
23:12
<Hixie>
well then
23:12
<Hixie>
something is wrong
23:12
<Hixie>
for this graph doesn't make sense
23:12
<jgraham>
(imagine flipping coins; the population is infinite then too)
23:13
<Hixie>
there's no way that if i find 1000 page out of 1e9 that the error is less than if i find 10000
23:13
<Hixie>
i guess it makes sense that the error would be symmetric
23:14
<Hixie>
about 50%
23:14
<Hixie>
since otherwise you could just define your problem as its reverse and your error would drop to zero
23:14
<Hixie>
but shouldn't the error for n 0.01% or n 99.99% be greater than for n 50%?
23:15
<Philip`>
It should peak at x=5e8, y=1.6e4, not at y=5e4, I think
23:15
<Hixie>
er yes, i meant 1.5e4 but the 1. was cut off on my display
23:18
<Hixie>
annevk: cool, thanks (re wiki page)
23:20
<Philip`>
Hixie: If you had a coin that gave heads 55% of the time, you wouldn't be surprised if it gave 50 heads out of 100 throws, because that's within expected random variation. But if you had a coin that gave heads 5% of the time, you would be surprised if you got 0 heads out of 100 throws (because the chance of that is 0.95^100 = 0.6%)
23:21
<Philip`>
So it's a 5% difference between sample and population means in both cases, but that's expected in the n/N=50 case and too extreme in the n/N=0 case
23:21
<Hixie>
fair enough
23:21
<Philip`>
so the expected variation is much lower nearer n=0
23:22
<Hixie>
makes sense
23:22
Hixie
looks at the actual numbers
23:22
<Philip`>
(though the binomial normal approximation model breaks down when you actually get n=0)
23:22
<annevk>
Hixie, creating a new stats page?
23:22
<Hixie>
annevk: no, bloo made me think about it
23:23
<Hixie>
Philip`: so in a sample of 7e9 pages as my recent one, if i find 500 pages with a tag, that's really 500 +/- 22?
23:23
<Hixie>
i guess that makes sense
23:23
<jgraham>
Near n=0 it becomes poisson-like, right? So the error ~sqrt(n)
23:24
<Philip`>
Near n=0 I think you can just calculate the binomial directly, instead of approximating
23:25
<jgraham>
Right, but if you just want a good estimate and are lazy :)
23:26
<Philip`>
Hixie: 22 is the standard deviation, not the expected error - I think it's something like 66% chance that the sample mean is within +/- 1 s.d. of the true mean
23:26
<Philip`>
Hixie: so you want 2 s.d. (500 +/- 44) for 95% confidence
23:26
<Hixie>
ah right
23:26
<Philip`>
(I do hope I'm remembering this right...)
23:27
<Hixie>
this is all basically a complicated way of saying "we can't really tell anything for sure but we might as well assume it's all right"
23:27
<jgraham>
Philip`: The bit about std deviations is right
23:27
<Philip`>
The 95% thing means if you do this 20 times then you can expect to be wrong once, but hopefully only a little bit wrong :-)
23:28
<Hixie>
except i can't
23:28
<Hixie>
since i can't take a different sample
23:28
<jgraham>
But I _think_ you can make an estimate of the uncertainty on p using this method
23:28
<Hixie>
and i know the numbers precisely for my actual "sample"
23:28
<jgraham>
which is what you really care about
23:28
<Philip`>
Hixie: You should take a random sample of your 7e9 pages, and then you could do proper statistics on that, using the 7e9 as the population :-)
23:29
<Hixie>
that would be worthless
23:29
<Hixie>
since i can just do it on the whole thing!
23:29
<jgraham>
(like naievely you could say the probability of a page containing the tag is 500/7e9 +/- 22/7e9, only if might be more complicated than that)
23:30
<jgraham>
s/naievely/naively/
23:30
<jgraham>
s/if/it/
23:30
<Hixie>
if it's 500 +/- 44 out of N to have 95% confidence that the same proportion applies in the population as a whole
23:30
<Hixie>
that means that out of any random sample of N pages, there'll be 500-44 to 500+44 out of N that have this feature
23:31
<Hixie>
right?
23:31
<Hixie>
which is basically no error
23:31
<Hixie>
i mean, on the cosmic scale of things
23:31
<Philip`>
If they're random samples from the same infinitely large population (where "infinitely large" means "much larger than the sample size"), then yes
23:31
<jgraham>
http://en.wikipedia.org/wiki/Margin_of_error
23:32
<jgraham>
Philip`: Which brings me back to the point about it being good if you can say the error bars are meaningless
23:32
<Philip`>
(because obviously if sample size = population size then you'll find precisely 500 in any sample of size N, so you have to assume infinite population to make sure the samples are independent, I think)
23:32
<Hixie>
i'm going to continue pretending that the margin of error is as close to 0 as makes no difference so long as i find something on more than 10000 or so pages
23:33
<Hixie>
Philip`: yeah, the sad thing here is that the samples aren't at all random for me. They're the N most interesting pages, for some pretty precise and known-useful definition of interesting
23:33
<jgraham>
Hixie: Well if you can get a variation from 0.2-0.044 depending on which pages you sample you're dominated by systematic error anyway
23:34
<Hixie>
jgraham: exactly
23:34
<Philip`>
It seems unlikely that 456 vs 544 pages using some feature would have any practical significance on design decisions, which is all that really matters
23:34
<Hixie>
right
23:38
<Philip`>
Mostly it's just nice to not use too many decimal places when presenting data, like 1e4 out of 7e9 should be 0.00014% and not 0.0001429%, because meaningless decimal places remind me of physics lessons :-)
23:39
<Hixie>
yeah well in my case i have to round the data and add in some error anyway to keep the data from being too accurate
23:39
<Hixie>
so
23:39
<Hixie>
:-)
23:39
<Philip`>
How do you know if you're adding enough error? :-)
23:40
<Hixie>
i'm pretty sure i add enough
23:40
<Hixie>
and that's all i'll say about that :-P
23:42
<takkaria>
Hixie: where's the html5 svn repo viewer online? I can't seem to find it
23:42
<Hixie>
there's a link at the top of the spec
23:43
<takkaria>
that figures. :) ta
23:44
Philip`
is slightly reminded of Cryptonomicon, calculating exactly how much of the collected signals intelligence could be used before it would become sufficiently accurate that it would reveal its source
23:44
<Hixie>
yeah
23:44
<Philip`>
except this isn't quite as serious as a war
23:44
<Hixie>
indeed
23:45
<Hixie>
one of the things i do is report numbers for different characteristics from samples collected at different times
23:45
<Hixie>
so the numbers aren't self-consistent even if you try to combine them
23:46
<Hixie>
(they're close enough though)
23:46
<Hixie>
(to draw conclusions from for the spec, i mean)
23:46
<Philip`>
It's nice to work in areas that are trivial in the grand scheme of things, like HTML, so it doesn't matter when you mess up :-)
23:47
<Hixie>
yeah really
23:47
<Hixie>
we can have a big impact, but if we screw up, oh well! no biggie
23:48
<Philip`>
The internet is a demonstration that you can mess up quite a large number of things and we'll still carry on just fine
23:52
<Hixie>
aaah
23:53
<Hixie>
i broke mathml
23:53
<Hixie>
and didn't notice
23:53
<Hixie>
crap
23:54
<Hixie>
how do we handle <mglyph>
23:54
<fearphage>
http://files.myopera.com/fearphage/static/bugs.xhtml?http://my.imaginary/site/ this document was originally made and served as text/html. not its served as application/xhtml+xml. can anyone tell me why #299801 (3rd test from the bottom) is failing and how to make it pass (if possible). the problem revolves around xml + document.evaluate with a null namespace
23:54
<fearphage>
is there a way to query xml nodes using xpath with a null namespace?