| 00:05 | <MikeSmith> | aleray: if that is in fact a message coming from the html5lib module and not lxml itself, gsnedders might have a clue |
| 00:07 | <MikeSmith> | aleray: a code grep indicates that it's an lxml message |
| 00:07 | <MikeSmith> | src/lxml/lxml.etree.c |
| 00:09 | <aleray> | MikeSmith, thanks. I found this solution: `html = ''.join(c for c in html if valid_xml_char_ordinal(c))` on a forum |
| 00:09 | <MikeSmith> | ah, that's a generated file |
| 00:09 | <MikeSmith> | ah OK |
| 00:09 | <aleray> | not sure if it strips any important thing though |
| 00:10 | <aleray> | or just junk characters (the HTML is generated with ckeditor from word documents) |
| 00:10 | <aleray> | Here is the link to the forum: http://www.itsprite.com/pythonfiltering-out-certain-bytes-in-python/ |
| 00:10 | MikeSmith | looks |
| 00:12 | <MikeSmith> | looks like http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python might also be useful |
| 00:12 | <aleray> | MikeSmith, same post actually. thanks for pointing ti the source though |
| 00:13 | <MikeSmith> | ah ok |
| 00:13 | <MikeSmith> | these days I always go to StackOverflow first |
| 00:15 | <aleray> | MikeSmith, so it seems to work. I have another small issue: I parse an html fragment with no root element. my code fails because of that because it is expecting a root node and I get a list of elements instead |
| 00:16 | <MikeSmith> | aleray: yeah I understand that problem there but can't be of much help just at the moment. Will be freed up in about 2 hours if you're still around |
| 00:17 | <aleray> | MikeSmith, thanks. i'll be sleeping probably. Any direction to search myself? |
| 00:43 | <gsnedders> | aleray: known bug, but kinda horrible to fix accurately and hence not done yet :\ |
| 00:43 | <gsnedders> | aleray: I'm increasingly leaning towards just hacking together some horrible fix for it that should at least /mostly/ fix it |
| 01:02 | <aleray> | gsnedders, Hi; talking about the invalid xml characters? |
| 01:03 | <aleray> | I'm glad the solution I found seems working at least |
| 01:04 | <aleray> | gsnedders, may be you could help me with the other thig, that is to be able to parse a fragment with lxml and get an tree rather than a list of elements |
| 01:06 | <aleray> | the code here: http://dpaste.com/0XQHK9D raises an `AttributeError: 'list' object has no attribute 'xpath'` |
| 01:06 | <aleray> | because `tree = parser.parseFragment(html)` returns a list |
| 01:07 | <aleray> | because my fragment contains several elements |
| 01:13 | <aleray> | etree and lxml behave differently. See http://dpaste.com/3SE8103 and http://dpaste.com/3G7N7JS |
| 01:13 | <aleray> | I'd like the etree behaviour with lxml, So I could use xpath on it |
| 08:39 | <zewt_> | "filename.zip may harm your browsing experience, so Chrome has blocked it" nice of Chrome to ask my permission before "blocking it" a harmless file that I now have to download again in firefox |
| 08:39 | <zewt_> | all browsers have turned to crap |
| 08:49 | <zewt_> | when did it become okay for browsers to override the user on his own system, bodes deeply ill for the future of the web |
| 08:53 | <JonathanC> | https://www.w3.org/community/ DataSheets - need members |
| 10:19 | <aleray> | I'm stuck with my problem from yesterday: using lxml the parseFragment methods gives me a list of nodes |
| 10:19 | <aleray> | with etree it gives me one single element |
| 10:19 | <aleray> | because I have a list, I can't use methods like "xpath" |
| 11:39 | <annevk> | 2016 is close and the Location object is still poorly understood and defined: https://lists.w3.org/Archives/Public/www-archive/2015Oct/0051.html |
| 12:16 | <gsnedders> | aleray: hmm… both options seem kinda bad :\ |
| 14:26 | <aleray> | gsnedders, hi, what do you meean? |
| 19:24 | <nox> | annevk: Should an url object's query become null again when its URLSearchParams becomes empty? |