WHATWG on 2022-06-23

07:48	<annevk>	smaug: Luca Casonato: see https://github.com/w3c/FileAPI/issues/43 for changing that (it shouldn't lowercase parameters as that can change meaning in certain cases, such as with `boundary`)
08:49	<emilio>	sideshowbarker: when you're around, can you merge https://github.com/validator/htmlparser/pull/70? Henri will be afk for a couple weeks :)
10:26	<sideshowbarker>	sideshowbarker: when you're around, can you merge https://github.com/validator/htmlparser/pull/70? Henri will be afk for a couple weeks :) yup — just now merged
10:29	<emilio>	sideshowbarker: if you can review https://github.com/validator/htmlparser/pull/71 as well that'd be awesome, but no worries, it's not super-urgent (though it should be trivial :P). Tests and so have been reviewed already in https://bugzilla.mozilla.org/show_bug.cgi?id=1775477 fwiw :)
10:29	<sideshowbarker>	will look right now
10:30	sideshowbarker	checks the spec
10:31	<sideshowbarker>	hmm yeah `keygen` is not in the spec any longer, I see
10:31	sideshowbarker	now reads https://bugzilla.mozilla.org/show_bug.cgi?id=1775477
10:32	<emilio>	Yeah, all browsers agree on it being an HTMLUnknownElement except gecko when created from the parser, because we forgot to regenerate the C++ sources / ElementName.java :-(
10:33	<sideshowbarker>	heh :)
10:33	<sideshowbarker>	well, it’s merged now — so we’ll have proper interop for it
10:33	<sideshowbarker>	minor win but hey, it’s still good
10:34	<emilio>	yeah, fairly minor, but while I was at it seemed worth doing :)
10:34	<sideshowbarker>	yup
13:35	<zcorpan>	Hmm why is lists.whatwg.org incomplete? (Doesn't contain replies to the referenced email)
13:35	<zcorpan>	referenced in https://twitter.com/fmalina/status/1539963473488011264
17:22	<Domenic>	It was created from some static backup crawl that I believe was incomplete
17:22	<Domenic>	https://github.com/whatwg/whatwg.org/pull/270
17:23	<Domenic>	https://github.com/whatwg/meta/issues/153#issuecomment-566980200
20:08	<Seirdy>	I think that the discrepancy between the HTML standard's `aside` element and WAI-ARIA's `contentinfo` role warrants breaking the first rule of ARIA, but I'm just a novice so i thought i should ask first: In articles on my website, I have section permalinks following each `<h2>` (excluding those in sections with `role="doc-preface"`, since those aren't in the TOC). I want these permalinks to look more or less the same whether you use a textual browser (e.g. lynx) or a graphical browser, and to not be radically different to screen reader users. Here's an example article. Problem: some article extractors ("reader modes") don't handle these section permalinks well. DOM Distiller (Chromium) completely removes the headings and permalinks; Readability (Firefox, Brave), Safari reading mode, and Immersive Reader (used in Edge) include the permalinks; Trafilatura sometimes combines the headings and permalinks without any whitespace between them. Ideally, a reading-mode implementation would just remove the section permalinks but keep the headings; none of them do this. One change that would fix all these reading-modes would be wrapping the section permalinks in an `aside` element. That seems in line with the HTML Standard (ancilliary content that a reading-mode can exclude), but doesn't seem to be in line with WAI-ARIA: I don't want to create a `contentinfo` landmark for each section permalink! I could override the `aside` elements' implicit roles with an explicit `role` attribute, but I want to be careful; I've never broken the first rule of ARIA before, so I don't know if this is the "right place" to do so. Anyone more familiar with the standards know how I should navigate this discrepancy between the WHATWG HTML standard and WAI-ARIA?
20:16	<Domenic>	Permalinks don't seem like <aside>s to me. Recall that aside is a "section of the page", i.e. something that traditionally has its own headings, content, etc.
20:18	<Seirdy>	ah I see
20:18	<Seirdy>	How would you mark this up?
20:18	<Seirdy>	I've seen many different ways of doing this but I'm not particularly fond of them
20:18	<Seirdy>	because they make textual-browsers, graphical-browsers, and screen readers show content that is too different (IMO)
20:20	<Seirdy>	my current solution (no `<aside>`, just a hyperlink with an `aria-labelledby` that includes the visible link name and the section name) actuall works really well, in that all three aforementioned audiences perceive something almost identical and can navigate it pretty easily. That is, until you use literally any reading-mode implementation (hence the problem).
20:27	<Domenic>	I don't know how I would mark it up because I don't know the exact heuristics article extractors use to get rid of elements
20:29	<Domenic>	https://github.com/mozilla/readability/blob/master/Readability.js#L121 gives some potential clues... like if you put class="utility" maybe they will go away? In Readability at least?
20:29	<Seirdy>	hmm that's a last resort i guess
20:30	Seirdy	likes the perhaps-unrealistic ideal of making a website that sticks only to exisitng standards and doesn't have any implementation-specific quirks.
20:30	<Domenic>	But these tools you're trying to cater for don't have any standards governing them
20:30	<Seirdy>	(thi is a personal site after all heh)
20:31	<Seirdy>	well, they sort of do
20:31	<Seirdy>	readability and apple reader mode at least
20:31	<Domenic>	Oh, I wasn't aware? Where's the spec for them?
20:32	<Seirdy>	they use microdata with schema.org vocab to identify articles (when that markup exists), semantic html elements like `aside` and `article`, Dublin Core and more microdata for metadata extraction, etc.
20:32	<Seirdy>	so i guess it's not a standard for their behavior
20:32	<Seirdy>	more like standards they observe
20:32	<Seirdy>	DOM Distiller tries to be "smart" and also removes elements with high link-density
20:33	<Seirdy>	that's not a standard behavior
20:33	<Domenic>	Yeah, but the core behavior of stripping elements has no standard, so to the extent you want to interact with that, you're gonna have to use something nonstandard.
20:33	<Seirdy>	yeah i mean they identify the meanings of elements when extracting content, and they perform that identification according to existing standards
20:33	<Seirdy>	when they render that extracted content, that's nonstandard ofc
20:34	<Domenic>	It's not clear to me whether they perform the extraction before or after filtering out elements, but I take your meaning
20:35	<Seirdy>	honestly im probably overthinking this, but it's fun. i just find the whole concept of article-distillation so interesting :3
20:36	<Seirdy>	the way you gotta balance the need to observe standards (identify elements: what is an article and what isn't?) but still break them (remove parts of a page) presents a really fascinating problem.
20:36	<Domenic>	Yeah, I find it pretty interesting too. It's unfortunately at that intersection of technologies like "semantic web" where, if everyone did perfect markup + maybe we made a couple more specs, we could have a nice result. But 95% of people won't do perfect markup so heuristics rule.
20:36	<Seirdy>	well it's a combination
20:37	<Seirdy>	for instance, they use hueristics to determine if reader mode should even be presented on a page
20:37	<Seirdy>	those hueristics are gerated by a ML program
20:37	<Domenic>	E.g. reader mode won't even trigger in Firefox for any WHATWG specs because they are at the root of their domain and Firefox has a heuristic to not allow reader mode on root domains.
20:37	<Seirdy>	(Fathom on mozilla's side; check out #fathom:mozilla.org)
20:39	<Seirdy>	but when it comes to the actual article-distillation process on a page that has decent markup, Readability is quite predictable. It observes existing standards quite well.
20:39	<Domenic>	I wonder if an ML model would do better at article extraction than existing code like readability. If you had enough training data... but I do not know how much is "enough"
20:39	<Domenic>	I don't find https://github.com/mozilla/readability/blob/master/Readability.js#L2062-L2142 very predictable tbh :)
20:40	<Domenic>	`(!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 \|\| img > 2) && !this._hasAncestorTag(node, "figure"))` as just one line...
20:41	<Seirdy>	Domenic: DOM Distiller uses a weird-ass hybrid approach that combines multiple different extraction algos and uses hueristics to determine which algo to use. The algos were mostly written by humans, but they used the outputs of some machine learning programs as reference. And DOM-Distiller is a Java program transpiled to JavaScript.
20:41	<Seirdy>	all that complexity screams "Google project" to me, heh
20:42	<Seirdy>	ironically the article-extractor made by an adtech company (Google) is the most aggressive at removing sponsored content (link density hueristics give false positives).
20:43	<Seirdy>	also the version built into chromium is only available via chrome://flags, and the flag expired in m105 so 🤷
20:45	<Seirdy>	I don't find https://github.com/mozilla/readability/blob/master/Readability.js#L2062-L2142 very predictable tbh :) oh true
20:46	<Seirdy>	would be nice if reading-modes had a "simple" and "aggressive" mode. where the simple mode errs towards "trust the page" and strictly observes standards when identifying elements, while the "agressive" mode assumes the page has bad or deceitful markup to sneak in some ads and whatnot.
20:46	<Seirdy>	and users could switch between them