07:48 | <annevk> | smaug: Luca Casonato: see https://github.com/w3c/FileAPI/issues/43 for changing that (it shouldn't lowercase parameters as that can change meaning in certain cases, such as with boundary ) |
08:49 | <emilio> | sideshowbarker: when you're around, can you merge https://github.com/validator/htmlparser/pull/70? Henri will be afk for a couple weeks :) |
10:26 | <sideshowbarker> | sideshowbarker: when you're around, can you merge https://github.com/validator/htmlparser/pull/70? Henri will be afk for a couple weeks :) |
10:29 | <emilio> | sideshowbarker: if you can review https://github.com/validator/htmlparser/pull/71 as well that'd be awesome, but no worries, it's not super-urgent (though it should be trivial :P). Tests and so have been reviewed already in https://bugzilla.mozilla.org/show_bug.cgi?id=1775477 fwiw :) |
10:29 | <sideshowbarker> | will look right now |
10:30 | sideshowbarker | checks the spec |
10:31 | <sideshowbarker> | hmm yeah keygen is not in the spec any longer, I see |
10:31 | sideshowbarker | now reads https://bugzilla.mozilla.org/show_bug.cgi?id=1775477 |
10:32 | <emilio> | Yeah, all browsers agree on it being an HTMLUnknownElement except gecko when created from the parser, because we forgot to regenerate the C++ sources / ElementName.java :-( |
10:33 | <sideshowbarker> | heh :) |
10:33 | <sideshowbarker> | well, it’s merged now — so we’ll have proper interop for it |
10:33 | <sideshowbarker> | minor win but hey, it’s still good |
10:34 | <emilio> | yeah, fairly minor, but while I was at it seemed worth doing :) |
10:34 | <sideshowbarker> | yup |
13:35 | <zcorpan> | Hmm why is lists.whatwg.org incomplete? (Doesn't contain replies to the referenced email) |
13:35 | <zcorpan> | referenced in https://twitter.com/fmalina/status/1539963473488011264 |
17:22 | <Domenic> | It was created from some static backup crawl that I believe was incomplete |
17:22 | <Domenic> | https://github.com/whatwg/whatwg.org/pull/270 |
17:23 | <Domenic> | https://github.com/whatwg/meta/issues/153#issuecomment-566980200 |
20:08 | <Seirdy> | I think that the discrepancy between the HTML standard's Anyone more familiar with the standards know how I should navigate this discrepancy between the WHATWG HTML standard and WAI-ARIA? |
20:16 | <Domenic> | Permalinks don't seem like <aside>s to me. Recall that aside is a "section of the page", i.e. something that traditionally has its own headings, content, etc. |
20:18 | <Seirdy> | ah I see |
20:18 | <Seirdy> | How would you mark this up? |
20:18 | <Seirdy> | I've seen many different ways of doing this but I'm not particularly fond of them |
20:18 | <Seirdy> | because they make textual-browsers, graphical-browsers, and screen readers show content that is too different (IMO) |
20:20 | <Seirdy> | my current solution (no <aside> , just a hyperlink with an aria-labelledby that includes the visible link name and the section name) actuall works really well, in that all three aforementioned audiences perceive something almost identical and can navigate it pretty easily. That is, until you use literally any reading-mode implementation (hence the problem). |
20:27 | <Domenic> | I don't know how I would mark it up because I don't know the exact heuristics article extractors use to get rid of elements |
20:29 | <Domenic> | https://github.com/mozilla/readability/blob/master/Readability.js#L121 gives some potential clues... like if you put class="utility" maybe they will go away? In Readability at least? |
20:29 | <Seirdy> | hmm that's a last resort i guess |
20:30 | Seirdy | likes the perhaps-unrealistic ideal of making a website that sticks only to exisitng standards and doesn't have any implementation-specific quirks. |
20:30 | <Domenic> | But these tools you're trying to cater for don't have any standards governing them |
20:30 | <Seirdy> | (thi is a personal site after all heh) |
20:31 | <Seirdy> | well, they sort of do |
20:31 | <Seirdy> | readability and apple reader mode at least |
20:31 | <Domenic> | Oh, I wasn't aware? Where's the spec for them? |
20:32 | <Seirdy> | they use microdata with schema.org vocab to identify articles (when that markup exists), semantic html elements like aside and article , Dublin Core and more microdata for metadata extraction, etc. |
20:32 | <Seirdy> | so i guess it's not a standard for their behavior |
20:32 | <Seirdy> | more like standards they observe |
20:32 | <Seirdy> | DOM Distiller tries to be "smart" and also removes elements with high link-density |
20:33 | <Seirdy> | that's not a standard behavior |
20:33 | <Domenic> | Yeah, but the core behavior of stripping elements has no standard, so to the extent you want to interact with that, you're gonna have to use something nonstandard. |
20:33 | <Seirdy> | yeah i mean they identify the meanings of elements when extracting content, and they perform that identification according to existing standards |
20:33 | <Seirdy> | when they render that extracted content, that's nonstandard ofc |
20:34 | <Domenic> | It's not clear to me whether they perform the extraction before or after filtering out elements, but I take your meaning |
20:35 | <Seirdy> | honestly im probably overthinking this, but it's fun. i just find the whole concept of article-distillation so interesting :3 |
20:36 | <Seirdy> | the way you gotta balance the need to observe standards (identify elements: what is an article and what isn't?) but still break them (remove parts of a page) presents a really fascinating problem. |
20:36 | <Domenic> | Yeah, I find it pretty interesting too. It's unfortunately at that intersection of technologies like "semantic web" where, if everyone did perfect markup + maybe we made a couple more specs, we could have a nice result. But 95% of people won't do perfect markup so heuristics rule. |
20:36 | <Seirdy> | well it's a combination |
20:37 | <Seirdy> | for instance, they use hueristics to determine if reader mode should even be presented on a page |
20:37 | <Seirdy> | those hueristics are gerated by a ML program |
20:37 | <Domenic> | E.g. reader mode won't even trigger in Firefox for any WHATWG specs because they are at the root of their domain and Firefox has a heuristic to not allow reader mode on root domains. |
20:37 | <Seirdy> | (Fathom on mozilla's side; check out #fathom:mozilla.org) |
20:39 | <Seirdy> | but when it comes to the actual article-distillation process on a page that has decent markup, Readability is quite predictable. It observes existing standards quite well. |
20:39 | <Domenic> | I wonder if an ML model would do better at article extraction than existing code like readability. If you had enough training data... but I do not know how much is "enough" |
20:39 | <Domenic> | I don't find https://github.com/mozilla/readability/blob/master/Readability.js#L2062-L2142 very predictable tbh :) |
20:40 | <Domenic> | (!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && !this._hasAncestorTag(node, "figure")) as just one line... |
20:41 | <Seirdy> | Domenic: DOM Distiller uses a weird-ass hybrid approach that combines multiple different extraction algos and uses hueristics to determine which algo to use. The algos were mostly written by humans, but they used the outputs of some machine learning programs as reference. And DOM-Distiller is a Java program transpiled to JavaScript. |
20:41 | <Seirdy> | all that complexity screams "Google project" to me, heh |
20:42 | <Seirdy> | ironically the article-extractor made by an adtech company (Google) is the most aggressive at removing sponsored content (link density hueristics give false positives). |
20:43 | <Seirdy> | also the version built into chromium is only available via chrome://flags, and the flag expired in m105 so 🤷 |
20:45 | <Seirdy> | I don't find https://github.com/mozilla/readability/blob/master/Readability.js#L2062-L2142 very predictable tbh :) |
20:46 | <Seirdy> | would be nice if reading-modes had a "simple" and "aggressive" mode. where the simple mode errs towards "trust the page" and strictly observes standards when identifying elements, while the "agressive" mode assumes the page has bad or deceitful markup to sneak in some ads and whatnot. |
20:46 | <Seirdy> | and users could switch between them |