-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added text_content() method to selectors. #34
Conversation
It's useful to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.
Current coverage is
|
hi @paulo-raca ,
I would prefer that people get to know XPath string functions better. |
I agree 100% Also, I'd prefer it not have too many I don't think parsel is to reinvent xpath through a collection of shortcut methods. |
See also discussion at scrapy/scrapy#772. I like an idea of providing a customizable html2text-like features in parsel. This task is very common. This is not only about xpaths: users may want to extract text knowing only css selectors (xpath is not required to use parsel), and not all transformations are expressed in xpaths easily. I think parsel can't depend on https://github.com/Alir3z4/html2text/ for that though because html2text is GPL. IMHO .text() method should be a 'best effort' method to convert HTML to plain text, with some options (e.g. how to join parts? preserve newlines or not? normalize white spaces or not? handle paragraphs or not?). If we want to do that it is better to implement 'smarter' features from the beginning because if we just go with There is no default implementation which will satisfy everyone, so it is fine to be opinionated. I'd love to have something quick which works better than |
lxml has an html cleaner. |
IMHO, new method would be handy, particularly with some extra params (normalization, etc) But I'm OK with telling people to use XPath instead of bloating the API. If that is way you want to go, I think XPath's |
Note for the future: this implementation returns text inside |
lxml provides a well performing built-in for html sanitization from lxml.html.clean import Cleaner
clean_html = Cleaner(style=True).clean_html (style=False seems to be the only non-sane default) Html sanitization is a different topic. |
Yes, it requires parsing again. When using small fragments (the parts to be extracted from a page) It's the endless gotchas when using html fragments that make it non-straightforward. The resulting tree is unpredictable and whitespace is so inconsistent The hacks it takes to implement this are only a few My point in my earlier comments is that since getting the text content without sanitization |
Instead of creating new methods why not letting us specify which Something like the following: def extract(self, tostring_method=None):
"""
Serialize and return the matched nodes in a single unicode string.
Percent encoded content is unquoted.
"""
try:
return etree.tostring(self.root,
method=tostring_method or self._tostring_method,
encoding='unicode',
with_tail=False)
except (AttributeError, TypeError):
if self.root is True:
return u'1'
elif self.root is False:
return u'0'
else:
return six.text_type(self.root) |
@paulo-raca Are you OK with closing this pull request and continuing the discussion in #127? |
Sure 👍 |
I recently needed to extract the text contents of HTML nodes as plain old strings, ignoring nested tags and extra spaces.
While that wasn't hard, it is a common operations that should be built into scrapy