#whatwg on 2007-05-13

08:20	<mikeday>	is whatwg.org down or is it just me?
08:26	<Lachy>	it appears to be down
08:27	<mikeday>	is the HTML5 spec anywhere else, like w3.org?
08:27	<Lachy>	yes, in CVS
08:27	<Lachy>	dev.w3.org
08:27	<Lachy>	http://dev.w3.org/cvsweb/html5/
08:28	<mikeday>	awesome :)
09:45	<mikeday>	Hmm, the HTML5 spec seems to say that comments cannot occur before the root element
09:46	<zcorpan_>	mikeday: where do you read that?
09:46	<mikeday>	tree construction, 8.2.4.1. The initial phase
09:47	<zcorpan_>	that's before the doctype, no?
09:47	<hsivonen>	hmm. looks like the entire dreamhost is down
09:47	<hsivonen>	can't get to damowmow portal or the DOM viewer to check this
09:48	<mikeday>	ah, so only before the doctype
09:48	<hsivonen>	dreamhost has been down a bit too often lately
09:48	<zcorpan_>	mikeday: yeah... but then the #writing section goes ahead and says that comments are allowed before the doctype
09:48	<mikeday>	hrmph, that's helpful :)
09:49	zcorpan_	pointed that out before
09:50	<mikeday>	U+00 is converted to U+FFFD, but what about other weird characters like U+07?
09:52	<hsivonen>	mikeday: other weird stuff is preserved
09:52	hsivonen	has complained about that before
09:52	mikeday	is noticing a pattern here
09:53	<mikeday>	okay, one more thing: what does RCDATA stand for?
09:54	<zcorpan_>	replaced character data
09:55	<mikeday>	what exactly is replaced about it?
09:55	<zcorpan_>	entities
09:55	<mikeday>	can have entities... ah.
10:24	<annevk>	doesn't really matter what it stands for...
10:24	<annevk>	just implement the steps
10:49	<annevk>	oh, whatwg is down?
10:50	<annevk>	is the mail server down too?
10:50	annevk	wonders how that works
10:53	<annevk>	it seems that lists.whatwg.org is not down
10:53	<annevk>	on the other hand, my e-mail hasn't made it through to the archives yet...
11:18	<mikeday>	hi annevk
11:19	<mikeday>	took a look at the html5lib code, looks rather clean
11:20	<mikeday>	just toying with some C code
11:20	<mikeday>	it's a shame that you've got to do so much irrelevant stuff in C, though.
11:23	<annevk>	python is nice
11:23	<annevk>	especially to "quickly" prototype stuff like this
11:23	<annevk>	the problem is that it doesn't scale well for very large pages, such as the HTML5 spec
11:24	<mikeday>	you could probably speed it up, at the risk of making the code much uglier...
11:26	<annevk>	yeah... rather have a fast C implementation with Python wrappers I think
11:28	<mikeday>	that's the spirit, outsource the ugliness somewhere else :)
11:30	mikeday	ponders
11:30	<mikeday>	the data state can have a very tight inner loop, just scanning for the next & or <
11:31	<annevk>	or EOF
11:31	<annevk>	charsUntil() handles EOF automatically
11:31	<annevk>	so you know
11:32	<mikeday>	I'm assuming you're working on a chunk of data, so you know there is no EOF in the middle of the chunk
11:33	<annevk>	if you do script execution document.close() might do that
11:33	annevk	isn't sure
11:33	<annevk>	but it depends on how you implement stuff, I guess
11:33	<mikeday>	right
11:34	<mikeday>	I wonder which is faster: if '&' else if '<' else ..., or a table lookup
11:34	<mikeday>	eg. if charTable[currChar] == MARKUP_CHAR
11:35	<annevk>	from the little I know I believe table lookup is faster
11:35	<annevk>	however, how would you handle "any other character" in that case?
11:35	<annevk>	(I don't think I'm the right person to discuss this with though.)
11:36	<mikeday>	any other character would be the else case
11:37	<annevk>	that would work nicely then I suppose
11:37	<mikeday>	if (... == MARKUP_CHAR) { change state } else { keep accumulating character data }
11:37	<mikeday>	always frustrates me that efficient code looks less and less like the specification, though
11:37	<mikeday>	we still don't have a magical compiler that converts spec -> code
11:38	<annevk>	just use the tests from html5lib
11:38	<annevk>	and maybe contribute some more
11:38	<annevk>	and pay some attention to the spec too :)
11:39	<mikeday>	right :)
11:42	<mikeday>	hmm, using the HTML5 spec as a test document is rather meta
11:42	<mikeday>	especially considering it's not very well-formed :/
11:43	<annevk>	the multipage version of HTML5 is generated using html5lib
11:43	<annevk>	that's meta
11:44	<mikeday>	neat :)
11:47	<hsivonen>	mikeday: do you use a DFA for XML?
11:48	<mikeday>	hsivonen, not yet, but I'd like to
11:48	<mikeday>	I've generated one, but haven't got around to building a parser around it yet.
11:48	<hsivonen>	mikeday: surely a function call per tokenizer state is good enough considering that it is the de facto way to write XML parsers
11:49	mikeday	shrugs
11:49	<mikeday>	for HTML5 you mean?
11:49	<hsivonen>	I intend to optimize away the explicit state variable but I hesitate going all the way to a hand-rolled DFA
11:50	<hsivonen>	mikeday: I meant a function call (possibly inlined by compiler) per state in the HTML5 tokenizer spec
11:50	<hsivonen>	mikeday: the XML parsers that I've looked at work roughly that way
11:51	<mikeday>	after looking at the spec, I've seen that the state machine is rather more complicated than the average DFA
11:51	<mikeday>	with XML it's easier, as you're going from grammar to DFA
11:52	<annevk>	there are some additional switches indeed based on tree construction feedback
11:52	<annevk>	although I think you should be able to integrate those too
11:52	<mikeday>	right, it would take a bit of messing around though
11:52	<annevk>	(it leads you further away from the spec though)
11:52	<mikeday>	that too.
11:52	<annevk>	shouldn't be much of an issue I think...
11:53	<mikeday>	by the way, a tiny test seems to show that the if/else is slightly faster than table
11:53	<mikeday>	if only two characters are being checked for
11:53	<annevk>	see, don't trust me :)
11:53	<mikeday>	but if three or more characters are being checked for, table wins by far
11:53	<annevk>	oh, ok :)
11:53	<mikeday>	eg. for whitespace characters it would be a win
11:54	<mikeday>	for the data state inner loop, not so much
11:54	<hsivonen>	I wonder if it is possible to construct a hash function that hashes all UTF-16 code units to a small range of integers so that markup-significant characters get unique scalars and neutral characters overlap
11:55	mikeday	grins
11:55	<hsivonen>	(and effient one, that is)
11:55	<hsivonen>	efficient even
11:55	<mikeday>	let's see, markup significant characters are all < U+007F
11:56	<mikeday>	just make sure that everything above 127 is mapped to 127..255 range
11:57	<mikeday>	and ASCII stays as it is
11:57	<mikeday>	or do you want & and < to map to the same small integer?
11:57	<hsivonen>	didn't think that far
11:57	<hsivonen>	gotta go. later
11:57	mikeday	waves
11:58	<mikeday>	hrm, jumping into the micro-optimisation, I forgot that no one uses UTF-16 anyway
11:58	<mikeday>	(for given values of no one)
12:02	<annevk>	in some states unicode chars are important
12:02	<mikeday>	?
12:02	<annevk>	tag name state
12:02	<annevk>	but I suppose that doesn't matter much
12:03	<annevk>	that's actually in the anything else case so...
12:03	<annevk>	nm me
12:03	<mikeday>	I noticed that the tag names all get lowercased
12:03	<mikeday>	that would mean that <camelCase> XML tags can't be embedded in HTML5, right?
12:06	<annevk>	ASCII lowercase, yes
12:06	<annevk>	XML can't be embedded in HTML5
12:06	<mikeday>	true, you could have camelCase tags as long as they use accented letters :)
12:07	<mikeday>	are unknown tags still added to the DOM?
12:08	<annevk>	of course
12:08	<annevk>	there's in fact no difference between "unknown tags" and <span> for instance
12:08	<annevk>	(iirc)
12:09	<mikeday>	so arbitrary vocabularies can be included,
12:09	<mikeday>	as long as they don't require <camelCase>
12:09	<mikeday>	or plain uppercase, for that matter
12:09	<mikeday>	seems like MathML would work fine
12:10	<annevk>	there's no namespace support either
12:11	<annevk>	but in due course we would add limited support for that I suppose
12:12	<Philip`>	annevk: Doesn't http://dev.w3.org/cvsweb/~checkout~/html5/spec/Overview.html?rev=1.12&content-type=text/html;%20charset=iso-8859-1#pixel cover the points about how an arbitrary object is treated as ImageData?
12:13	<annevk>	oh, I think I've been looking at an old version of the spec
12:14	<Philip`>	mikeday: I'd expect table lookups to usually be much slower than if/elses in real programs because you won't be able to keep the lookup table in the cache for very long (if you're processing lots of other data at the same time) and it'll have to do really expensive memory reads
12:15	<Philip`>	People used to use lookup tables for fast sin/cos calculations, but now it's much quicker just to get the CPU to recalculate it every time because memory is slow
12:15	<mikeday>	Philip`, the table is pretty small, 256 bytes, but the processing other data at the same time constraint could be a problem
12:16	<Philip`>	Caches are pretty small too :-)
12:16	<annevk>	thanks Philip`
12:16	<Philip`>	(like, uh, 16KB or something?)
12:16	<Philip`>	(depending on what processor you have)
12:17	<mikeday>	the whitespace test requires five else if branches, though
12:18	<mikeday>	at least it wouldn't be hard to try both methods on real world data
12:18	<mikeday>	as it's not really fundamental to the structure of the code
12:25	<met_>	annevk why is on http://annevankesteren.nl/2006/08-paintr21 It works in Firefox (given a few hacks), with the notable exception of the "Save it!" button.? Save works for me in FF 2.0.0.3
12:26	<met_>	the only difference is nice Paintr logo in Opera vs. text logo in FF
12:28	<met_>	ah see the logo is made by css content:url
12:30	<annevk>	that thing was made before FF2
12:30	<met_>	can you update the text? 8-)))
13:11	Philip`	wonders if <div irrelevant><img ...><img ...></div> would be a sensible way of pre-loading images to be used in a canvas, so you can just wait for window.onload and then be sure all the images are loaded
13:12	<annevk>	I think if you do img.src in a script the load event is delayed as well
13:14	<Philip`>	Oh, that sounds better
13:21	<Dashiva>	What's the deal with r\^ole?
13:22	<Philip`>	It's the (La)TeX spelling, I believe
13:23	<Dashiva>	of rôle?
13:24	<Philip`>	Maybe, but my IRC client mangles that
13:24	Philip`	looks in the log
13:24	<Philip`>	Ah, yes, that
13:25	<Philip`>	Same as rôle too, but not quite so ugly
13:26	<Dashiva>	But what's wrong with just role, was more my question
13:26	<Lachy>	aargh! I've asked 3 times for Patrick (or anyone else) to provide examples of tables that would benefit from the headers attribute, and each time he's bypassed the question entirely
13:26	<annevk>	lol, people are wasting their time on www-html? :)
13:26	<Lachy>	it's so annoying that they won't contribute when asked, and then bitch about being ignored
13:27	<annevk>	they are indeed
13:27	<Philip`>	Oh - just spelling it "role" seems far more sensible :-)
13:27	<annevk>	fun
13:27	<Dashiva>	Isn't that what the semantic web is all about?
13:27	<Dashiva>	Getting other people to do all the work, and then complaining about nothing happening
13:32	<Philip`>	That sounds like the approach of getting authors to mark up all their data correctly in a machine-processable form, so you can build advanced search engines on the semantic web that correctly understand the relationships between pieces of data
13:33	<Philip`>	compared to e.g. Google, which just puts up with whatever rubbish authors create
13:33	<Philip`>	but it's kind of obvious which one is doing better at the moment
13:40	<maikmerten>	wow, seems Opera's layout engine is 1345% more green that other competing engines... impressive http://en.wikipedia.org/wiki/Comparison_of_layout_engines_(WHATWG)
13:41	<maikmerten>	one keeps wondering why such things make it into Wikipedia
13:41	<Dashiva>	Probably because all browsers have their share of fanatical fanboys
13:42	<annevk>	prolly also because it doesn't list all the WHATWG features
13:45	<Philip`>	You could replace the whole first table with "Web Forms 2.0: No ? Yes" and then Opera wouldn't be seen as having such an unfair lead
13:46	<Dashiva>	Thinking of it as a lead is a problem to begin with, IMO
13:47	<annevk>	Safari for instance does support type=range iirc
13:47	<annevk>	Firefox supports persistent storage
13:47	<Philip`>	Also one could change <video> to no in Opera, because it's not fair to count very experimental builds that don't even match the WA1 spec
13:47	<annevk>	Internet Explorer supports parts of drag & drop, draggable, contenteditable, etc.
13:49	Philip`	wonders if anyone has made a <canvas> paint program that can save and load from globalStorage
13:49	<Philip`>	Oh, actually, that wouldn't work because you can't draw data: images then call toDataURL again :-(
13:53	<annevk>	Maybe the new definition of origin helps with that?
13:54	<annevk>	Cause in theory that would be a safe image, unless you got it after a redirect
13:54	<Philip`>	"The origin of a Document or image that was generated from a data: URI found in another Document or in a script is the origin of the that Document or script." - oh, sounds like that covers it
13:55	<annevk>	Although if you store it in globalStorage and then retrieve it later...
13:55	annevk	ponders
13:56	<Philip`>	You'd just get a string out of globalStorage, and I assume strings don't have complex security arrangements
13:56	<Philip`>	and then you'd create an image from that string, but that image would be created in your own document
13:57	<annevk>	sounds tricky
13:58	<Philip`>	(If you've got the data: string, you could rewrite libpng in JS and get the image data anyway, so the only problem is in whether you're allowed to get the string in the first place)
13:58	<Philip`>	(and you should be allowed to get strings from globalStorage, because otherwise it'd be a bit pointless...)
13:58	<Philip`>	but I don't know if that agrees with what the spec says
13:59	<annevk>	I suppose data: URLs not retrieved from <img> objects or non-same origin <canvas> objects are to be considered safe
13:59	<annevk>	and that therefore invoking toDataURL() should not fail and drawImage() should not mark the <canvas> object non-same origin
14:03	<annevk>	I suppose the problem is that painting a data URL might not always be safe
16:40	<annevk>	http://weblog.200ok.com.au/2007/05/what-i-want-from-new-markup-spec.html
16:44	<Lachy>	hmm. Looks like we need some kind of tutorial to explain how the heading structure works
16:45	<annevk>	http://www.kavoir.com/2007/05/html5-adopted-by-w3c.html is someone who thinks Chris Wilson will be editor
16:48	<Philip`>	Also thinks Microsoft is one of the key contributing groups in the WHAT-WG
16:48	<annevk>	http://ma.gnolia.com/people/apartness/bookmarks/prejesh
16:50	<annevk>	http://www.designerstalk.com/forums/web-standards/26075-web-standards-danger.html
16:51	<annevk>	http://www.elementary-group-standards.com/web-standards/web-standards-html5-support-existing-content.html
16:59	met_	is glad he is wringting in Czech only, so all his mistakes cannot by discussed here 8-)
17:00	<annevk>	I wonder why people on www-html think there was some arbitrary descision process going on... The sole reason <samp> and such are still here is because dropping them would cost more.
17:01	<annevk>	I think there have hardly been any arbitrary descisions with regards to HTML5
17:06	<wilhelm>	Why would one want to drop such elements?
17:09	<csarven>	annevk tsk tsk <m>
17:11	<Lachy>	annevk, I think he's just using code, samp, etc. to make a point about dropping things like headers="" and summary=""
17:12	<Lachy>	personally, I somewhat agree with keeping headers (I'm just trying to get them to help find evidence for it), though I'm undecided about summary
17:14	<Philip`>	http://canvex.lazyilluminati.com/misc/summary.html is how people seem to be using summary now
17:15	<Philip`>	((Can't remember if I pointed that out here before))
17:17	<Lachy>	Philip`, what was the total sample size surveyed?
17:19	<Lachy>	wow, so many of them are used for presentational purposes
17:22	<Philip`>	That was 2523 pages, of which 105 had a summary attribute anywhere
17:23	<Lachy>	I think we need a larger sample size
17:23	<Philip`>	The results are probably misleading because a few sites have a lot of distinct summaries
17:24	<Lachy>	the results should be grouped by domain name to deal with that
17:25	<Philip`>	It also seems quite hard to analyse the results automatically since pretty much everyone uses totally different strings (except for those that use "")
17:25	<Philip`>	But it would be useful to get much better data than this
17:26	<Lachy>	yeah, you could probably try to filter on things like the word "layout" and maybe the length (e.g. < 4 words is relatively useless)
17:29	Philip`	would try to do something better if he didn't have far too much urgent work to do now instead :-)
17:30	<Lachy>	are you going to release the code of the tool soon, so others can work with it?
17:31	<Philip`>	I'll attempt to do that once I have time
17:31	<Philip`>	It's not like it's particularly interesting or difficult code, though - it just downloads a load of pages into a database, then parses them all and walks through the tree trying to find things that match some condition, then sticks the results in a table
17:32	<Philip`>	(Can you get something like an XML database that does really fast queries on tree-structured data? That'd be quite handy for this kind of thing, after working around the problem that lots of sites can't be serialised into well-formed XML)
17:33	<zcorpan_>	TagSoup?
17:33	<met_>	Philip` have you some experience with xml databases?
17:34	<Philip`>	met_: None at all
17:34	<met_>	my colleagues recoomentder me http://exist.sourceforge.net/ but i never tried
17:35	<met_>	also you can use xml in postgresql (with xpath etc.), don't mentioning Oracle and MS SQL
17:37	<Philip`>	Ah, looks like it could be useful
17:38	<met_>	and here is a link about postgresql and xpath http://www.throwingbeans.org/postgresql_and_xml.html
17:39	<met_>	ms sql2005 and oracle (not sure wich version) have it natively as xml datatypes
17:40	<Philip`>	Hopefully the databases do some kind of indexing, because running unindexed queries over 100MB of XML doesn't sound like the absolute fastest thing ever
17:41	<Philip`>	or maybe I'm thinking from the wrong perspective for this kind of thing
17:41	<met_>	ms and oracle yes
17:44	<Philip`>	(For added fun, some of my downloaded documents are actually PDF files, parsed by html5lib into something that I expect is quite hideous. Maybe I should check the content-type on these things...)
17:45	<met_>	whow
17:45	<met_>	and what other types like *.doc etc
17:46	<Philip`>	I don't see any of those
17:46	<Philip`>	I just got the URLs from Yahoo search results (since they're nicer than Google and still provide search APIs), so it's limited to what they files they think are worth putting in the results
19:24	<annevk>	csarven, what about it?
19:24	<csarven>	i find <m> arbitrary but im sure <samp> has its own story
19:25	<annevk>	<samp> is just there because dropping it would have little value
19:26	<annevk>	<m> is there because lots of pages use it
19:26	<annevk>	aiui
19:26	<Philip`>	I thought HTML5 was starting from a clean slate and only adding features when there's good enough reasons to justify adding them...
19:27	<csarven>	lots of pages use lots of things =)
19:27	<Philip`>	(or at least I'm fairly sure I remember people using that as an argument)
19:27	<csarven>	Philip` that would be the ideal approach but it is not always the case
19:28	<annevk>	Philip`, in general, ye
19:28	<annevk>	s
19:55	annevk	tends to agree with David Baron that for implementors every HTML feature needs to be specified
19:55	<annevk>	(this includes <frameset>)
20:11	<hsivonen>	annevk: yeah. If you build navigation systems, you need to know that the earth is round even if a flat earth would be nicer
20:13	<Lachy>	annevk, are you referring to David's latest on www-html? I didn't get the relevance, since the discussion was related to document conformance only.
20:14	<annevk>	the contents of his e-mail are relevant imo
20:14	<annevk>	although I agree it didn't make much sense in context
20:15	<Lachy>	sure, it's relevant to the spec in general
20:15	<hsivonen>	is there now relevant discussion on www-html? I unsubscribed to respect the HTML WG email recess.
20:16	<Lachy>	hsivonen, not really
20:16	<Lachy>	I'll let you know when something important is posted
20:17	<hsivonen>	Lachy: thanks
20:18	<Lachy>	nice! I can refer to this next time someone tries to shift the burden of proof on to me to disprove their claim http://en.wikipedia.org/wiki/Burden_of_proof#Science_and_other_uses
21:19	<tantek>	Lachy, nice reference, I hadn't seen that before and ended up writing up our own for microformats.org: http://microformats.org/wiki/brainstorming#Burden_of_Proof