#whatwg on 2007-06-26

01:16	<Hixie>	well i was going to do some research of year-based data based on Last-Modified headers, but most pages don't actually serve them
01:16	<Hixie>	so that's gone out of the window
01:18	<kingryan>	Hixie: have you thought of using the wayback machine?
01:18	<kingryan>	from archive.org?
01:18	<Hixie>	how?
01:19	<kingryan>	they make their data available to researchers
01:19	<kingryan>	they have indexes from crawls by alexa that are roughly every 6 months since about 1994
01:19	<kingryan>	if you want to do comparative study based on time, you could use those buckets
01:21	<Hixie>	it's not clear to me that their system could support parsing every single file in their index
01:21	<kingryan>	I think it'd only be possible through the alexa web search apis
01:22	<Hixie>	yeah, that's not really enough for what i want to do
01:22	<Hixie>	(find how element usage varies over time)
01:22	<Hixie>	(and class, and id)
01:22	<kingryan>	yeah, you're probably right
01:25	<Hixie>	so i scanned about 100,000 documents (not really at random, so this may not be representative)
01:25	<Hixie>	about 100,000 of them had no last-modified headers
01:25	<Hixie>	about 20000 of them said 2007
01:25	<Hixie>	1 of them said 200 AD
01:26	<Hixie>	oh i see, it actually said Tue, 14 Oct 02003 06:53:14 GMT
01:26	<bewest>	heh. wise guy, eh?
01:26	<othermaciej>	served from a stone tablet?
01:26	<Hixie>	1 said 2044
01:26	<Hixie>	a number said 2099
01:27	<Hixie>	and a spanish one said Mon, 26 Jul 2250 05:00:00 GMT
01:27	<kingryan>	maybe we need to define Time5
01:27	<kingryan>	or Calendar5
01:27	<Hixie>	there's also a number of files from 1971 to 1994
01:27	<Hixie>	which is impressive since the web started in 1990
01:28	<Hixie>	but not impossible
01:28	<Hixie>	wow some of them aren't even joking
01:28	<Hixie>	http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=4803661&dopt=Abstract
01:28	<Hixie>	^ 1973
01:29	<kingryan>	I suppose that be an accurate LM header then
01:29	<Hixie>	looks like all the 1971-1979 dates are from nih.gov
01:30	<Hixie>	and this one from 1985 actually redirects to nih.gov heh
01:38	<Hixie>	spec says you MUST use GMT
01:38	<Hixie>	apparently some people in europe didn't understand what MUST means
01:38	<Hixie>	also what kind of date is "Mon, 22 Jan 2007 23:21:22 GMT,Tue, 07 Feb 2006 09:16:47 GMT" ??
01:39	<Hixie>	wow, all kinds of random formats are used
01:39	<Hixie>	sheesh
01:39	<Hixie>	how hard can this be
01:39	<Hixie>	"{ts '2007-04-29 03:40:38'},{ts '2007-04-29 03:40:38'}" is NOT a valid Last-Modified date!
01:39	<Hixie>	come on people!
01:50	<Hixie>	in my sample of 100000 or so files, there were about 1000 unique _formats_
01:50	<Hixie>	for the date
01:51	<kingryan>	any valid ones?
01:51	<Hixie>	there are only three valid formats per the spec, which would come up as 10 or so the way i counted it
01:51	<Hixie>	so that's about 990 invalid ones
01:52	<kingryan>	and you said "so i scanned about 100,000 documents" and "about 100,000 of them had no last-modified headers"
01:52	<kingryan>	I'm guessing one of those is off by an order of magnitude
01:52	<Hixie>	actually no
01:52	<Hixie>	i was _about_ 100,000 files, and _about_ 100,000 of them had no date
01:52	<Hixie>	both numbers to 1sf
01:52	<kingryan>	gotcha
01:53	<Hixie>	actual numbers were closer to 140000 and 100000, i think
04:33	<Hixie>	wtf is up with svn.whatwg.org
05:04	<Hixie>	http://junkyard.damowmow.com/283
05:04	<Hixie>	not very scientific
05:05	<Hixie>	but that seems to be the distribution of years in the Last-Modified headers
05:05	<Hixie>	on the web
05:41	<Lachy>	wow, I wasn't aware the google bot had access to all web pages in space and time! It'd be interesting to see what's in the pages that were last modified in 2250, just to get a glimpse of the future ;-)
05:43	<Hixie>	:-)
05:43	<Hixie>	see #whatwg for background on those numbers
05:43	<Hixie>	wait this is #whatwg
05:43	<Hixie>	aaah
05:43	<Hixie>	confusing
05:43	<Lachy>	lol
05:44	<Lachy>	should I check the logs from past or future discussion?
05:44	<Hixie>	hah
05:44	<Hixie>	last block of the logs (when i was talking to ryan)
09:05	<hsivonen>	I wonder if 1969 is actually meant to be 1970-01-01 but time zones make it fall in 1969-12-31
09:05	<hsivonen>	I would have expected to see numbers since 1992 and a peak in 1970
09:06	<hsivonen>	the data points in between and before are surprising
09:07	<othermaciej>	I thihnk the default date on macintosh systems was 1969 at one point
09:10	<Hixie>	the numbers from 1971 to 1990 are intentional -- i spot checked some and they were of a site that made articles from those years available
09:10	<annevk>	Lachy, svg:svg is not a selector
09:10	<annevk>	Lachy, its svg\|svg
09:10	<Lachy>	oops
09:10	<annevk>	Lachy, which would be a SYNTAX_ERR in IEs case
09:11	<Hixie>	i suppose i'd better actually implement all the spec changes i made recently
09:11	<annevk>	Lachy, because they don't support namespaces...
09:12	<Lachy>	they can add sufficient support for namespaces in selectors to at least understand the syntax, they just don't have a DOM with namespaces
09:14	<annevk>	they actually do... sort of
09:15	<Lachy>	yeah, they sort of do with xml data islands and stuff, but that's their mess to sort out
09:17	<hsivonen>	Lachy: their mess is generally ours to sort out
09:46	<Hixie>	http://junkyard.damowmow.com/284
09:47	<Hixie>	i wonder what all the low numbers are
09:47	<Hixie>	other than the timezone ones
09:47	<Hixie>	and what's with the hundreds of pages in the early 1900s?
09:48	<Hixie>	i wonder if a few million pages per year is enough to get decent trends data on element class and ID usage
09:49	<Hixie>	there are more pages that claim to be from 2008 than from 1991
09:49	<Hixie>	given how unlikely it is for a page to be from 2008, i wonder what tells us about the pages that claim to be from 1991
09:51	<Hixie>	time to go home
09:51	<Hixie>	i love how there's a spike at 2038 (max 32bit time_t)
09:54	<zcorpan>	http://mrclay.org/index.php/2007/06/25/kill-these-dom0-shortcuts/
09:55	<Hixie>	zcorpan: yeah, saw that. i wonder what we should do. we could deprecate those names, but it seems like a slippery slope.
09:56	<othermaciej>	you could make use of them nonconforming, but then suddenly you have conformance criteria on scripts
09:56	<zcorpan>	browsers can log overwrites in the error console
09:56	<zcorpan>	well, <form name> is already non-conforming
09:56	<othermaciej>	the special names don't always take precedence over built-in properties
09:56	<Hixie>	anyway
09:56	<zcorpan>	oh
09:56	<Hixie>	going home now
09:56	<Hixie>	later all
09:57	zcorpan	waves
09:57	<annevk>	g'night
09:57	<othermaciej>	I mean depending on the object
09:57	<othermaciej>	for HTMLFormElement they do
09:57	<othermaciej>	which is sad
09:57	<zcorpan>	<input name=submit>
09:58	<othermaciej>	for the remaining elements where name is allowed, you could make use of names that conflict with built-in DOM properties nonconforming
09:58	<zcorpan>	yeah
09:58	<othermaciej>	but then there are some things that do special lookup like this by id too
10:01	<othermaciej>	the things in WebKit that have overriding get-by-name in WebKit are HTMLFormElement, HTMLFrameSetElement, HTMLObjectElement, HTMLEmbedElement, HTMLAppletElement and HTMLDocument
10:01	<othermaciej>	not sure if this is a complete list
10:01	<othermaciej>	(Window lookup by name is non-overriding I think)
10:12	<Lachy>	I don't get why Robert Burns thinks dropping <img> and <embed> in favour of a new element would work.
10:12	<Lachy>	he seems to be thinking entirely about accessibility and fallback, and ignoring every other issue like backwards compatibility
10:13	<Lachy>	and the fact that replacing <img> with <object> was already tried and mostly failed
10:15	<kfish>	Lachy, some people just like abstractions for the sake of abstraction :-)
10:15	<kfish>	whereas others prefer clarity for the sake of clarity
11:03	<Hixie>	i wonder if robin misunderstood lachy's e-mail
11:04	<annevk>	I believe he wants them to be case-sensitive
11:05	<Hixie>	in which case he misunderstood the e-mail
11:05	<annevk>	fair enough
11:22	Lachy	goes to respond to Robin to clarify it for him
11:33	<Jero>	"...but no start tag token has ever been emitted by this instance of the tokeniser (fragment case)..." This simply means the stack is empty, right?
11:34	<annevk>	if that's what it means it would be better if the spec said that...
11:35	<Jero>	well I'm not sure if that's what it means, but it basically seems like it does
11:36	<Jero>	should I send Hixie an email?
11:38	<annevk>	why not
11:38	<annevk>	I suppose it might be a while to get an answer so I'd just go ahead with something and test it
11:38	<annevk>	maybe compare with html5lib
11:40	<Jero>	is it completely up to date?
11:40	<annevk>	was this a recent change?
11:40	<annevk>	I'm not sure if it's up to date with fragment parsing per se
11:41	<Jero>	I'm working on revisions 908 till 960
11:41	<Jero>	though I'm not sure in which revision this change was made
11:42	<annevk>	I made most of those, didn't see fragment cases though
11:43	<Jero>	hmm ok
11:43	<Jero>	I'll just send Hixie and interpret it as "...if the stack of open elements is empty..." for now
11:44	<annevk>	alternatively you could test browsers
11:44	<Jero>	hmm yeah
11:44	<Jero>	i'll do a couple of tests
11:57	<Jero>	actually it's quite logical, if no start tag has been omitted, then there's no reason to check if the closing tag is the closing tag for the element that triggered the (R)CDATA state
11:57	<Jero>	thus checking if no start tag has been omitted is practically the same to check if the stack is empty
12:37	<annevk>	hsivonen++
12:46	<zcorpan>	would i send email to xml-names-issues⊙wo for bugs in the namespaces in xml 1.0 spec?
12:57	<annevk>	would or should?
12:58	<hsivonen>	zcorpan: what bug?
12:59	<hsivonen>	I think I've finally gotten the byte stream decoding right
12:59	<hsivonen>	whew. that was hard
13:00	<hsivonen>	and I only solved the cases that are needed for my tokenizer. a general-purpose InputStreamReader substitute would be even harder
13:05	Lachy	throws a tomato at hsivonen :-P
13:06	<annevk>	hsivonen, to properly handle unicode?
13:10	<Jero>	Why can the "DOCTYPE public/system identifier (single/double-quoted) state" not be combined with the "Before DOCTYPE public/system identifier state"?
13:10	<Jero>	the QUOTATION MARK case could simple say "get all characters until the next QUOTATION MARK or EOF character"
13:11	<Jero>	same for the APOSTROPHE case
13:11	<annevk>	because that's not the way the rest of the states work (such as attribute values)
13:12	<annevk>	you could implement it that way though
13:12	<Jero>	right, but the same would also apply to the attribute values then, right?
13:12	<annevk>	well, attribute values special case & too for obvious reasons
13:13	<Jero>	oh yeah, that's right
13:13	<hsivonen>	annevk: to properly decode a byte stream into char[] while using a decoder API that I didn't design, recovering from error, reporting errors at the same time and keeping track of the 512 byte boundary
13:14	<annevk>	is char[] unicode aware in Java?
13:14	<annevk>	yeah, it is iirc...
13:14	<hsivonen>	annevk: char[] is an array of UTF-16 code units
13:15	<Jero>	annevk: but then again, why should the DOCTYPE states not be changed because the other states don't work that way? It's not like they conflict with eachother
13:15	<annevk>	so not necessarily 16 bits, right?
13:15	<hsivonen>	annevk: char[] is an array of unsigned 16-bit values
13:15	<the_mart>	It only supports the BMP though.
13:15	<annevk>	Jero, I like the current way better... It's just a way of writing things done. not worth debating too much about I think
13:16	<Jero>	true
13:16	<hsivonen>	the_mart: what supports only the BMP?
13:16	<the_mart>	char in Java.
13:16	<annevk>	hsivonen, so what about code units that require more than 16 bits? I believe they exist...
13:16	<hsivonen>	the_mart: char yes, but char[] supports astral planes if you use it right
13:17	<annevk>	BMP?
13:17	<hsivonen>	annevk: UTF-16 code units are always 16 bits. code points that don't fit in 16 bits are handles as two code units
13:17	<hsivonen>	annevk: Basic Multilingual Plane
13:18	<annevk>	ah, code points
13:18	<annevk>	that makes sense
13:18	<the_mart>	It uses surrogate pairs, but Java doesn’t have native support for them.
13:19	<hsivonen>	the_mart: java.nio.charset uses surrogate pairs natively
13:19	<hsivonen>	the_mart: my code is fully astral-aware
13:19	<the_mart>	Really?
13:19	<hsivonen>	the_mart: yes
13:19	<the_mart>	I’ll have to look at that.
13:19	<hsivonen>	the_mart: Sun even has done the right thing for java.io classes
13:20	<hsivonen>	the_mart: the implementation is hairy when you read one char at a time and the decoder needs to look ahead
13:20	<hsivonen>	the_mart: that's why I said I only covered the cases that my tokenizer needs
13:20	<the_mart>	Can it convert them to UTF-8 properly?
13:20	<hsivonen>	the_mart: yes. with error detection and everything
13:20	<the_mart>	Wow.
13:21	<the_mart>	I’m not really a Java person myself. :)
13:21	<hsivonen>	The JDK together with ICU4J is one of the best Unicode wrangling platforms around if you know what you are doing. (I do. :-)
13:22	<hsivonen>	far from perfect but other platforms suck more
13:22	<the_mart>	I prefer to program in C#.
13:23	hsivonen	prefers Sun shackles over Microsoft shackles
13:23	<the_mart>	:)
13:23	<zcorpan>	hsivonen: it's unclear whether two attributes with same local name and namespace is a fatal error or not
13:23	<the_mart>	Well it is standardised by ECMA.
13:24	<hsivonen>	zcorpan: interesting. I've never considered that case
13:24	<hsivonen>	the_mart: I don't value standards org labels that much
13:25	<zcorpan>	hsivonen: firefox/safari abort parsing. ie/opera don't. the spec says it's illegal but doesn't explicitly say that it's a namespace constraint
13:26	<Lachy>	has anyone made an issue page for longdesc on the wiki yet? I can't find one mentioned anywhere
13:26	<hsivonen>	zcorpan: have you tested Xerces2-J?
13:26	<zcorpan>	hsivonen: no
13:26	<zcorpan>	http://simon.html5.org/test/xml/ns-malformed/001.xml
13:27	<the_mart>	Does IE actually support namespaces in XML though?
13:27	<hsivonen>	zcorpan: hmm. Ælfred2 does not detect an error
13:28	<zcorpan>	the_mart: yes
13:30	<hsivonen>	I guess I'm on the hook for fixing that if the XML folks decide it is a reportable error
13:31	<hsivonen>	that being Ælfred2 behavior
13:31	<annevk>	hmm, Opera fails too
13:32	<zcorpan>	i just wonder where i should report it. xml-names-issues isn't open anymore
13:33	<the_mart>	Isn’t it covered in section 6.3 of Namespaces in XML?
13:35	<zcorpan>	"The confusion comes from document conformace section that says regrading namespace-well-formedness that 'element and attribute names MUST match the production for QName and MUST satisfy the "Namespace Constraints". All other tokens in the document which are REQUIRED, for XML 1.0 well-formedness, to match the XML production for Name MUST match this specification's production for NCName'. Duplicate attributes issue is not explicitly mark
13:35	<zcorpan>	"namespace constraint" however."
13:36	<annevk>	I'd e-mail xml-editor
13:36	<zcorpan>	ok
13:56	<zcorpan>	"deprecated" is such a misunderstood term
13:57	<zcorpan>	people say that target="" is deprecated in html4 strict. but it really is forbidden in html4 strict but deprecated in html4 transitional
13:59	<annevk>	removing <img> is so not going to fly
14:18	<zcorpan>	http://forums.whatwg.org/viewtopic.php?t=69
14:19	<the_mart>	At least they don’t say that it’s “depreciated”. ;)
14:25	<annevk>	zcorpan, yeah, I noticed that error too, haven't reported it yet though...
14:25	<zcorpan>	i can forward the forum post to the list
14:30	<annevk>	sure
15:06	<zcorpan>	http://lists.whatwg.org/pipermail/whatwg-whatwg.org/2007-June/012022.html er, i must have screwed something up there
15:07	<zcorpan>	i positively had blank lines around the inner quote when i wrote it
15:35	<met_>	'Much of XHTML 2 works already in existing browsers' ( http://www.w3.org/TR/xhtml2/introduction.html#backCompat )
15:36	<annevk>	if you use client side XSLT... sure
15:38	<zcorpan>	much of FooML works already in existing browsers, too
16:02	<annevk>	Lachy, he means exceptions inside NSResolver
16:03	<annevk>	Lachy, which are raised while the UA executes it (I think, anyway)
16:04	<Lachy>	oh, I didn't realise. I just assumed he meant the exceptions that are actually defined in the spec
16:06	<Lachy>	it should go to the caller anyway, at least in ecmascript, but it would really depend solely on how the programming language handles exceptions
16:17	<Philip`>	Could lookupNamespaceURI be called again after an exception has been thrown, before selectElement has returned? (Maybe some implementation with a non-interruptible selector system would just set an 'exception' flag when an exception is thrown, but then carry on as normal, before finishing and then rethrowing the exception out of selectElement, or something...)
16:18	<annevk>	yeah, it should probably say whether exceptions are ignored or re-raised
16:19	Philip`	wonders what would happen if you made lookupNamespaceURI call selectElement recursively
16:19	<Philip`>	(I guess JS implementations have a recursion limit, but does that apply to JS calling native code calling JS calling native code ...?)
16:21	<Philip`>	(I can't actually think of any existing cases where JS callbacks are run synchronously, but probably just because I'm unfamiliar with the area)
16:32	<Lachy>	any suggestions for wording to put in the spec?
16:35	Philip`	doesn't really know anything about it :-)
16:42	<Lachy>	I suppose it would work like a callback function, like in Array.forEach(callback)
16:52	<Lachy>	Does this sound ok? "If an exception is raised by the NSResolver while resolving namespaces, processing must be aborted and the exception passed back to the caller."
17:01	<Philip`>	That seems to make sense to me
17:01	<Lachy>	I haven't checked it in yet. I sent it to the list to see if someone has any better suggestions, since the issue is not entirely clear to me either
17:02	<Philip`>	though the word "passed" doesn't seem to fit perfectly for exceptions, since that makes them sound more like return values, but I can't think of anything better
17:02	<Lachy>	perhaps "propogated" instead
17:02	<Philip`>	(Also, I guess it should say "NSResolver (or ECMAScript Function)" like I vaguely remember it saying elsewhere)
17:02	<Lachy>	propagated, even
17:03	<Philip`>	That sounds reasonable
17:03	<Lachy>	not necessary, since I've already defined that the ECMAScript Function is just a special language binding for the NSResolver
17:04	<Philip`>	Ah, okay
18:38	Lachy_	wonders why the video codec thread is continuing. I thought the solution was already explained.
18:40	<Lachy>	As long as third parties are able to provide browser plugins and codecs that work with <video>, UAs don't need native support for every format built in. Firefox, for example, shoud be able to invoke QuickTime for MP4 content, as long as QuickTime provides an appropriate API for FF to work with.
18:41	<Lachy>	or even VLC
18:49	<the_mart>	Yeah, and it’s a bit harsh how some people keep having a go at Apple over it.
18:55	<tndH>	I suspect some people will still be arguing after the patents have expired.
20:05	<maikmerten>	Lachy, well, the problem is: VLC isn't really legal in many countries and QuickTime isn't installed on many system. I do think it may make sense if browsers try to invoke external media frameworks if they can't handle content themselves, though.
20:05	<maikmerten>	however, they still should ship with at least one set of codec content providers can rely on
20:06	<maikmerten>	in worst case that'd mean the market would be split between WMV, MP4 and Ogg.
20:07	<maikmerten>	but that means you "only" need to encode 3 versions to server like 99% of potential customers ;)
20:07	<Lachy>	They don't have to ship with VLC in the browser.
20:08	<maikmerten>	right
20:08	<maikmerten>	well, anyway, at least on Windows the more generic choice would be DirectShow
20:08	<maikmerten>	on Mac it would be QuickTime
20:08	<maikmerten>	and on Linux perhaps GStreamer
20:09	<Lachy>	any third party should be able to write and distribute a plugin that will work with the browser, and if VLC does that from their site, no-one can stop any user from downloading it
20:09	<maikmerten>	that should suffice, combined with one natively supported codec
20:10	<maikmerten>	well, as a matter of fact VLC does have a browser plugin already
20:10	<Lachy>	anyway, I should get some sleep. good night
20:10	<maikmerten>	night
21:22	<Hixie>	Jero?
21:22	<Jero>	yes?
21:22	<Hixie>	so that thing you were asking about
21:22	<Jero>	the "...but no start tag token has ever been emitted by this instance of the tokeniser (fragment case)..." thing?
21:23	<Hixie>	yeah
21:23	<Hixie>	let me find it, hold on
21:23	<Jero>	sure
21:23	<Hixie>	ah, i see
21:23	<Hixie>	it doesn't mean "is the stack empty", because the stack is basically never empty (at least not in the fragment case)
21:24	<Hixie>	nor does it mean "is there only one thing in the stack"
21:24	<Hixie>	e.g. it wouldn't fire for the second "</" in <html><head></head></head></html>
21:24	<Hixie>	it literally means that no start tag token has ever been emitted
21:25	<Hixie>	e.g. because you're doing the innerHTML of a <style> element
21:25	<Jero>	oh i see, so in the fragment case, you really need to keep track of the amount processed start tag?
21:34	<Hixie>	Jero: somehow or other, yeah
21:34	<Jero>	ok, thanks for your response
23:27	Hixie	wonders if someone is going to point out to Sebastian
23:27	<Hixie>	that XHTML 1 and XHTML 2 have the same problem
23:27	<Hixie>	and that in fact XHTML2 and HTML have the same problem
23:28	<Hixie>	and that XHTML5 and HTML5 are good matches for precisely the reason he gave...
23:30	nickshanks	winders what ian is going on about
23:30	<nickshanks>	*wonders even
23:30	<Philip`>	http://lists.w3.org/Archives/Public/public-html/2007Jun/0866.html
23:31	<Hixie>	yeah
23:36	<nickshanks>	yeah, he seems to have slipped up there
23:36	<nickshanks>	but his surname makes up for that
23:37	<zcorpan>	i also don't see how he can know that using the name xhtml5 will result in more confusion than a different name (that he didn't propose)
23:38	<zcorpan>	e.g., if we call the xml serialization of html5 "bob", will there be less confusion than if we called it "xhtml5"?
23:39	<nickshanks>	Would IE 8's implementation be called Microsoft Bob the?
23:39	<nickshanks>	then
23:42	zcorpan	will create an xml serialization of html3.2. and name it xhtml3.2
23:47	<nickshanks>	HTML 3.0 had some nice things in it
23:47	<nickshanks>	so don't neglact that one too :)
23:47	<Hixie>	any other than maths and <credit> that we haven't taken yet?
23:47	<zcorpan>	<note>
23:47	<Hixie>	<aside>
23:48	<nickshanks>	i still want an <image> element that takes fallback content
23:50	<nickshanks>	Hixie: do you recall how many webpages used <image> as an empty element (i.e. they meant <img>)?
23:50	<zcorpan>	nickshanks: how would you check that?
23:51	<nickshanks>	by looking for </image>
23:51	<nickshanks>	oh, never mind, the google web survey only counted opening tags
23:53	<Hixie>	use <object>
23:53	<Hixie>	we can't change <image> handling.
23:55	<othermaciej>	<image> is one of those things that makes you doubt reading people's reading comprehension
23:57	<zcorpan>	</p style=border:solid> -- opera and safari render a border, ie and firefox don't
23:58	<nickshanks>	hahaha