Add attributes dict accessor #107

eliasdorneles · 2018-01-05T10:38:17Z

It's often useful to get the attributes of the underlying elements, and Parsel currently doesn't make it obvious how to do that.

Currently, the first instinct is to use XPath, which makes it a bit awkward because you need a trick like the name(@*[i]) described in this blog post.

This PR proposes adding two methods .attrs() and .attrs_all() (mirroring get and getall) for getting attributes in a way that more or less made sense to me.
What do you think?

kmike · 2018-01-15T15:12:27Z

Great idea, attribute extraction is something that's not obvious indeed!

Something to think about: often only a single attribute is needed. In this case users will be writing sel.attrs()['class']. It is tempting to allow it to be an argument: sel.attrs('class'). But in this case attrs name is not good, as a single attribute is returned. A middle-ground could be sel.attrib('class'), which works both for a single and multiple arguments, but it can be confused with .attrib attribute of lxml node, which is just a dict, not a function. FWIW, BeautifulSoup also provides .attrib attribute, like lxml (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes).

sel.attrs_all() looks confusing - based on its name, I'd expect it to return all attributes of an element.

What are use cases for attrs_all methods? I can see how it is useful e.g. for link extraction, but in this case it is better to have attributes only for elements with an attribute of interest. It can also be useful when combining .getall() results with attributes though.

I like that .attrs is implemented for SelectorList, it looks like an useful shortcut. On the other hand, we have .extract() which follows a different logic: a single value is returned for Selector, but a list of values is returned for SelectorList, and then a method with a different name (.get / .extract_first) as a shortcut to get a first value. If we follow this design, .attrs() should have a different return type, and there could be .attr_first() method for SelectorList. But that's a moot point, as .get() was created to change this.

Arguments against current API design (they don't mean current design is not the best, I'm not sure what the best design is, suggestions and ideas are welcome!):

Selector.attrs() -> lxml and BS provide .attrib attribute, which allows to select a single attribute with less effort
SelectorList.attrs() -> .extract() works in a different way.
Selector.attrs_all() -> it doesn't look useful, added only for consistency;
SelectorList.attrs_all() -> a different logic can be more helpful for some use cases like img src extraction

An alternative is to have .attrib, without having both methods. attrs_all looks easy to implement in user code, and less useful than .extract(); .attrib attribute makes code more similar to lxml and BS, and also allow to select specific attribute values with less effort. On the other hand, currently Selector instances don't have public attributes (.root is not documented), and it SelectorList.attrib makes less sense. Providing this attribute also would make introducing attrs_all method (or alike) later more problematic, as there won't be a parallel with .get / .getall.

Thoughts?

eliasdorneles · 2018-01-21T01:14:15Z

Heey Mike, thank you so much for this thoughtful review! <3

It seems you captured everything I thought about, and more!

About the need for something like attrs_all: some time ago I was using Parsel to explore an SVG file and wanted to get all the attribute names used in every element -- it took me a while because I didn't know about the .root.attrib trick and ended up using the XPath position trick. :)
To be honest, I didn't give THAT much thought about it.

Taking into account the things you mentioned, I'm now leaning more towards adding an .attrib element to Selector only, has the niceties of external consistency with lxml and BeautifulSoup, therefore probably being more like what an user would expect.
Plus, as you said, the attrs_all logic is easy to implement in user code.

So lemme update this with that change!

Thanks again for reviewing!

See discussion on: #107

codecov · 2018-01-21T01:30:51Z

Codecov Report

Merging #107 into master will increase coverage by 0.27%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #107      +/-   ##
==========================================
+ Coverage   99.34%   99.62%   +0.27%     
==========================================
  Files           5        5              
  Lines         307      266      -41     
  Branches       61       48      -13     
==========================================
- Hits          305      265      -40     
  Misses          1        1              
+ Partials        1        0       -1

Impacted Files	Coverage Δ
parsel/selector.py	`100% <100%> (+1.07%)`	⬆️
parsel/csstranslator.py	`98.57% <0%> (-1.43%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 58c85ca...ee042d1. Read the comment docs.

eliasdorneles · 2018-01-21T01:37:13Z

@redapple sorry, I inadvertently requested a review clicking your name in the reviewer list -- I actually wanted to visit your profile and check out your cool new profile pic 😄

But of course, feel free to weigh in if you want :D

kmike · 2018-01-23T12:58:48Z

Exposing .attrib looks minimal and safe, though I would probably miss something attrib-like for SelectorList.

eliasdorneles · 2018-01-23T14:37:06Z

Hm yeah...

What if we added an .attrib to SelectorList -- should we 1) return only for the first element or 2) do it like extract and return a list?
I lean more into option 1, because it reflects the behavior of XPath of converting a nodeset into a node and also because I think it's weirder to have different return types than ignoring the next sibling elements.

kmike · 2018-01-23T15:26:15Z

+1 to return .attrib for a first element in this case, or an empty dict if SelectorList is empty.

//cc @dangra @redapple @lopuhin are you OK with this design?

dangra · 2018-01-25T04:39:11Z

+1 to return .attrib for a first element in this case, or an empty dict if SelectorList is empty.
//cc @dangra @redapple @lopuhin are you OK with this design?

Honestly, Selector(List) api looks messy already, old and recent additions went on different directions without a common agreement.

`extract()` and `extract_first()` methods

Selector.extract() returns string.
Selector.extract_first() doesn't exist.
SelectorList.extract() returns list of strings.
SelectorList.extract_first() returns string or default=None if empty list.

== .extract() was part of the original API.
-- It returns different types based on parent type.
-- Selector and SelectorList interface is different (no Selector.extract_first())
-- the pattern didn't scale to new features like extract_first() that in retrospective are a clear mistake.

`re()` and `re_first()` methods

Selector.re() returns list of strings.
Selector.re_first() returns string or default=None if doesn't match.
SelectorList.re() returns list of strings.
SelectorList.re_first() returns string or default=None if doesn't match.

== re()/re_first() are "perfect"
-- re() is not a pure selector method, it does post extraction filtering plus processing (replace entities).
++ they returns same type and same meaning no matter if called for Selector or SelectorList.
++ They also offer same method signature for Selector and SelectorList.
== they could be called re/reall() following get/getall() pattern except it breaks backwards compatibility.
== Or deprecate re/re_first() methods and extend get/getall() signatures with regex= parameter.

`get()` and `getall()` methods

Selector.get() returns string. It doesn't accept default parameter.
Selector.getall() returns list of strings.
SelectorList.get() returns string or default=None if empty list.
SelectorList.getall() returns list of strings.

++ Same return types across classes.
-- .get() signature is different for Selector and SelectorList

New `attrib` property

Selector.attrib returns dict
SelectorList.attrib returns dict or empty dict if empty list.

== Handy shortcut to access attributes
++ Coherent with lxml and BS interfaces
++ Coherent with get() and clear path to add attriball if needed
-- Not possible to distinguish between not matching node and node with no attributes
-- As it is a property so we can't control default value for above case.

Keeping the discussion around .attrib property, I think the only thing that worries me is the last point about caller not being able to distinguish between non-matching node and node without attributes.

For the other part of the API we probably need a document to discuss on. My starting proposal would be to deprecated extract*() and re*(). Absorb regex matching and replace_entities as get*() arguments and add .text shortcut much like .attrib.

kmike · 2018-01-25T13:42:29Z

A small note about .text shortcut: I'd prefer it to be a method, because there are many ways to extract text from html snippet (see also: #34).

I don't have a strong opinion about .attrib being None or an empty dict for empty SelectorList; you need to either handle an exception or check if SelectorList is empty. I slightly prefer an empty dict because an empty SelectorList or a SelectorList where a first element is not what you want look similar from user point of view. Also, with .extract_first it is common to use result directly as an item value, while for .attrib I expect that in most cases it'd be a lookup for a specific attribute. But None is fine as well, you just get a different exception.

As for deprecating extract/extract_first, I think we should start with switching Scrapy docs to get/getall.

eliasdorneles · 2018-02-26T01:22:04Z

Hey fellows, following up one month later... sorry for the long delay, been a bit behind in the OSS stuff lately!
So I've added the .attrib for SelectorList, does this look good now?
Thanks!

kmike · 2018-02-26T11:03:50Z

Thanks @eliasdorneles! I think the current pull request is fine, and the only unresolved question is whether attrib should be None or an empty dict for empty SelectorList objects.

@dangra wrote:

Keeping the discussion around .attrib property, I think the only thing that worries me is the last point about caller not being able to distinguish between non-matching node and node without attributes.

So, let's check example use patterns:

# --- get an attribute, raise an error if attribute is not present:
sel.css('a').attrib['href']  # in both cases

# --- get an attribute, use None as its default value, but raise an error if selector is empty:

# None
sel.css('a').attrib.get('href')

# {}, though not exactly equivalent
sel.css('a')[0].attrib.get('href')

# --- get an attribute, use None as its value when attribute 
# is not present or a selector is empty:

# None
(sel.css('a').attrib or {}).get('href')

# {}
sel.css('a').get('href')

Examples convinced me None is better: it allows to distinguish between non-matching selector and a first element without attributes, and is not much harder to use when the opposite behavior is required. It also looks easier to debug in cases you've got a selector wrong, there will be less questions about "element has attributes, but css('my-selector").attrib is empty".

So +1 to switch to None, add an example to parsel docs, and merge.

eliasdorneles · 2018-02-26T14:14:13Z

So, I was hoping I wouldn't need to jump into the bikeshed wagon, but here I go... :D

The problem of using None is that it will give rise to this kind of code:

if sel.css('a').attrib:
    something = sel.css('a').attrib['something']

with all its variants:

something = sel.css('a').attrib['something] if sel.css('a').attrib else None
something = (sel.css('a').attrib or {}).get('something')

I would also bet that the null-check would be done after hitting the problem (possibly in production).

It may not seem like a big deal, but forcing all users to add a null-check to guarantee their code won't break just because of an use-case that is likely not that common, it makes for a counter-intuitive API. Moreover, users that actually need to distinguish between empty selectors will probably check sel.css('a') already anyway.

I hereby invoke the motto practicality beats purity. :)

The main reason we're adding attrib to SelectorList is for it to be a shortcut, which means we're already thinking "practicality beats purity" -- we wouldn't want to return attributes for a list if we were caring more about the type consistency than the usability.

TL;DR: I believe that 1) always returning a dict will make it easier for users to write robust code for the most common case, and that 2) returning None will not make the resulting API any cleaner (nor dirtier, either) and is more prone to errors.

Does this make sense?

ghost · 2018-02-27T13:00:04Z

FWIW.. I agree with @eliasdorneles that returning an empty dict would make a more ergonomic API, even if it provides less information to the caller (e.g., on whether the attribute or selector result was empty).

But, to make things a little worse, can I ask whether attributes are guaranteed to be unique in a HTML element? Because I don't think there's anything preventing someone from doing <div foo="bar" foo="baz">, right? So should the API return an empty defaultdict(list), not just an empty dict?

That would mean that the usage starts to look a bit daft, though, so we're back to the question of "purity vs. practicality". Even if sel.attrib['foo'] == ['bar', 'baz'] would be more correct, it would make the vastly-more-common usages more complex.

kmike · 2018-02-27T13:02:52Z

But, to make things a little worse, can I ask whether attributes are guaranteed to be unique in a HTML element?

Yes, as per https://www.w3.org/TR/html5/syntax.html#elements-attributes: "There must never be two or more attributes on the same start tag whose names are an ASCII case-insensitive match for each other." Also, lxml provides attributes as a dict, we're just using its API.

kmike · 2018-06-07T23:23:47Z

parsel/selector.py

+    def attrib(self):
+        """Return the attributes dictionary for underlying element.
+        """
+        return self.root.attrib


This returns an lxml.etree._Attrib instance, which can lead to unforseen consequences:

changes in this object are reflected in a tree;

this object may keep a reference to a tree, preventing response GC;

isinstance(sel.attrib, dict) is False

What do you think about returning dict(self.root.attrib) instead? It will ensure a copy, and make output a regular Python dictionary.

Oohh, good catch!
Sounds good, I'll do that in a jiffy!

kmike · 2018-06-07T23:33:59Z

I'm fine with both empty dict and None values for empty SelectorList, @eliasdorneles and @cathalgarvey convinced me that None is not better. Let's put a final decision on @dangra's shoulders :)

@kmike

This is done to prevent: - changes in this object be reflected in a tree; - keeping a reference to a tree and preventing response GC; - isinstance(sel.attrib, dict) to be False Thanks @kmike !

kmike · 2018-06-08T08:53:59Z

Thanks @eliasdorneles! Actually it was @dangra who brought this up and asked for a copy :)

dangra · 2018-06-08T13:35:40Z

party time 🎆

eliasdorneles · 2018-06-08T15:52:50Z

Yaay! Thank you, fellows! 🎉 :)

add attrs() and attrs_all() methods

6bd0054

change attribute accessor approach to use .attrib instead

c39553b

See discussion on: #107

eliasdorneles changed the title ~~Add attrs() and attrs_all() methods~~ Add attributes dict accessor Jan 21, 2018

eliasdorneles requested review from redapple and removed request for redapple January 21, 2018 01:33

eliasdorneles added 2 commits February 26, 2018 02:15

add .attrib for SelectorList

0eb7a45

add test for empty attributes cases

ec6d3cb

kmike reviewed Jun 7, 2018

View reviewed changes

return dict copy on sel.attrib

ee042d1

This is done to prevent: - changes in this object be reflected in a tree; - keeping a reference to a tree and preventing response GC; - isinstance(sel.attrib, dict) to be False Thanks @kmike !

kmike approved these changes Jun 8, 2018

View reviewed changes

dangra merged commit c8ff59b into master Jun 8, 2018

dangra deleted the add-attrs branch June 8, 2018 13:35

kmike added a commit that referenced this pull request Jun 21, 2018

DOC explain new .attrib property (a follow-up to #107).

1adf454

kmike added this to the v1.5 milestone Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add attributes dict accessor #107

Add attributes dict accessor #107

eliasdorneles commented Jan 5, 2018

kmike commented Jan 15, 2018 •

edited

Loading

eliasdorneles commented Jan 21, 2018

codecov bot commented Jan 21, 2018 •

edited

Loading

eliasdorneles commented Jan 21, 2018

kmike commented Jan 23, 2018 •

edited

Loading

eliasdorneles commented Jan 23, 2018

kmike commented Jan 23, 2018

dangra commented Jan 25, 2018

kmike commented Jan 25, 2018

eliasdorneles commented Feb 26, 2018

kmike commented Feb 26, 2018 •

edited

Loading

eliasdorneles commented Feb 26, 2018

ghost commented Feb 27, 2018

kmike commented Feb 27, 2018 •

edited

Loading

kmike Jun 7, 2018

eliasdorneles Jun 8, 2018

kmike commented Jun 7, 2018 •

edited

Loading

kmike commented Jun 8, 2018

dangra commented Jun 8, 2018

eliasdorneles commented Jun 8, 2018

Add attributes dict accessor #107

Add attributes dict accessor #107

Conversation

eliasdorneles commented Jan 5, 2018

kmike commented Jan 15, 2018 • edited Loading

eliasdorneles commented Jan 21, 2018

codecov bot commented Jan 21, 2018 • edited Loading

Codecov Report

eliasdorneles commented Jan 21, 2018

kmike commented Jan 23, 2018 • edited Loading

eliasdorneles commented Jan 23, 2018

kmike commented Jan 23, 2018

dangra commented Jan 25, 2018

extract() and extract_first() methods

re() and re_first() methods

get() and getall() methods

New attrib property

kmike commented Jan 25, 2018

eliasdorneles commented Feb 26, 2018

kmike commented Feb 26, 2018 • edited Loading

eliasdorneles commented Feb 26, 2018

ghost commented Feb 27, 2018

kmike commented Feb 27, 2018 • edited Loading

kmike Jun 7, 2018

Choose a reason for hiding this comment

eliasdorneles Jun 8, 2018

Choose a reason for hiding this comment

kmike commented Jun 7, 2018 • edited Loading

kmike commented Jun 8, 2018

dangra commented Jun 8, 2018

eliasdorneles commented Jun 8, 2018

kmike commented Jan 15, 2018 •

edited

Loading

codecov bot commented Jan 21, 2018 •

edited

Loading

kmike commented Jan 23, 2018 •

edited

Loading

`extract()` and `extract_first()` methods

`re()` and `re_first()` methods

`get()` and `getall()` methods

New `attrib` property

kmike commented Feb 26, 2018 •

edited

Loading

kmike commented Feb 27, 2018 •

edited

Loading

kmike commented Jun 7, 2018 •

edited

Loading