-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add attributes dict accessor #107
Conversation
Great idea, attribute extraction is something that's not obvious indeed! Something to think about: often only a single attribute is needed. In this case users will be writing
What are use cases for I like that .attrs is implemented for SelectorList, it looks like an useful shortcut. On the other hand, we have Arguments against current API design (they don't mean current design is not the best, I'm not sure what the best design is, suggestions and ideas are welcome!):
An alternative is to have .attrib, without having both methods. Thoughts? |
Heey Mike, thank you so much for this thoughtful review! <3 It seems you captured everything I thought about, and more! About the need for something like Taking into account the things you mentioned, I'm now leaning more towards adding an So lemme update this with that change! Thanks again for reviewing! |
See discussion on: #107
Codecov Report
@@ Coverage Diff @@
## master #107 +/- ##
==========================================
+ Coverage 99.34% 99.62% +0.27%
==========================================
Files 5 5
Lines 307 266 -41
Branches 61 48 -13
==========================================
- Hits 305 265 -40
Misses 1 1
+ Partials 1 0 -1
Continue to review full report at Codecov.
|
@redapple sorry, I inadvertently requested a review clicking your name in the reviewer list -- I actually wanted to visit your profile and check out your cool new profile pic 😄 But of course, feel free to weigh in if you want :D |
Exposing .attrib looks minimal and safe, though I would probably miss something attrib-like for SelectorList. |
Hm yeah... What if we added an |
Honestly, Selector(List) api looks messy already, old and recent additions went on different directions without a common agreement.
|
A small note about I don't have a strong opinion about .attrib being None or an empty dict for empty SelectorList; you need to either handle an exception or check if SelectorList is empty. I slightly prefer an empty dict because an empty SelectorList or a SelectorList where a first element is not what you want look similar from user point of view. Also, with .extract_first it is common to use result directly as an item value, while for .attrib I expect that in most cases it'd be a lookup for a specific attribute. But None is fine as well, you just get a different exception. As for deprecating extract/extract_first, I think we should start with switching Scrapy docs to get/getall. |
Hey fellows, following up one month later... sorry for the long delay, been a bit behind in the OSS stuff lately! |
Thanks @eliasdorneles! I think the current pull request is fine, and the only unresolved question is whether attrib should be None or an empty dict for empty SelectorList objects. @dangra wrote:
So, let's check example use patterns: # --- get an attribute, raise an error if attribute is not present:
sel.css('a').attrib['href'] # in both cases
# --- get an attribute, use None as its default value, but raise an error if selector is empty:
# None
sel.css('a').attrib.get('href')
# {}, though not exactly equivalent
sel.css('a')[0].attrib.get('href')
# --- get an attribute, use None as its value when attribute
# is not present or a selector is empty:
# None
(sel.css('a').attrib or {}).get('href')
# {}
sel.css('a').get('href') Examples convinced me None is better: it allows to distinguish between non-matching selector and a first element without attributes, and is not much harder to use when the opposite behavior is required. It also looks easier to debug in cases you've got a selector wrong, there will be less questions about "element has attributes, but css('my-selector").attrib is empty". So +1 to switch to None, add an example to parsel docs, and merge. |
So, I was hoping I wouldn't need to jump into the bikeshed wagon, but here I go... :D The problem of using
with all its variants:
I would also bet that the null-check would be done after hitting the problem (possibly in production). It may not seem like a big deal, but forcing all users to add a null-check to guarantee their code won't break just because of an use-case that is likely not that common, it makes for a counter-intuitive API. Moreover, users that actually need to distinguish between empty selectors will probably check I hereby invoke the motto The main reason we're adding TL;DR: I believe that 1) always returning a dict will make it easier for users to write robust code for the most common case, and that 2) returning Does this make sense? |
FWIW.. I agree with @eliasdorneles that returning an empty dict would make a more ergonomic API, even if it provides less information to the caller (e.g., on whether the attribute or selector result was empty). But, to make things a little worse, can I ask whether attributes are guaranteed to be unique in a HTML element? Because I don't think there's anything preventing someone from doing That would mean that the usage starts to look a bit daft, though, so we're back to the question of "purity vs. practicality". Even if |
Yes, as per https://www.w3.org/TR/html5/syntax.html#elements-attributes: "There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other." Also, lxml provides attributes as a dict, we're just using its API. |
parsel/selector.py
Outdated
def attrib(self): | ||
"""Return the attributes dictionary for underlying element. | ||
""" | ||
return self.root.attrib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This returns an lxml.etree._Attrib instance, which can lead to unforseen consequences:
- changes in this object are reflected in a tree;
- this object may keep a reference to a tree, preventing response GC;
isinstance(sel.attrib, dict) is False
What do you think about returning dict(self.root.attrib) instead? It will ensure a copy, and make output a regular Python dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oohh, good catch!
Sounds good, I'll do that in a jiffy!
I'm fine with both empty dict and None values for empty SelectorList, @eliasdorneles and @cathalgarvey convinced me that None is not better. Let's put a final decision on @dangra's shoulders :) |
This is done to prevent: - changes in this object be reflected in a tree; - keeping a reference to a tree and preventing response GC; - isinstance(sel.attrib, dict) to be False Thanks @kmike !
Thanks @eliasdorneles! Actually it was @dangra who brought this up and asked for a copy :) |
party time 🎆 |
Yaay! Thank you, fellows! 🎉 :) |
It's often useful to get the attributes of the underlying elements, and Parsel currently doesn't make it obvious how to do that.
Currently, the first instinct is to use XPath, which makes it a bit awkward because you need a trick like the
name(@*[i])
described in this blog post.This PR proposes adding two methods
.attrs()
and.attrs_all()
(mirroringget
andgetall
) for getting attributes in a way that more or less made sense to me.What do you think?