Date: | 2020-05-26 |
---|---|
Author: | Gábor Nyers |
tags: | python |
category: | python_workshop |
summary: | The Iterator protocol and its practical uses |
licence: | CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/ |
Contents:
The next Python Tuesday session is all about how iterators help us to solve many data-processing related problems. The term 'iterator' may sound abstract but there is a practical benefit and they everywhere in Python3. For example the functions like sum(), min(), max() or the much more powerful sorted(), filter() and map() functions.
- Introduction (what is the "Iterator Protocol" and do we care?)
- A bit of theory: Concepts
- Tools and data types
- Use cases
- As a Python programmer you rely on the Iterator protocol every time that
you:
- process the elements of a container data type in a
for
loop, e.g.:list
,dict
- sort the elements of a container data type
- use any of the following functions:
min()
,max()
,sum()
,map()
,sorted()
,filter()
- process the elements of a container data type in a
- The working of the above facilities is possible because the involved components work in concert according the "Itereator Protocol"
Fundamental task: process every element of a container, e.g.:
>>> s = 'Beautiful is better than ugly.' # str, with 30 elements >>> l = [ 1, 2, 92, -32, 'python', 3.1415 ] # list, 6 elements of 3 types >>> t = ( 3, 9, 'cheese shop', False ) # tuple >>> d = { 'alice': 32, 'bob': 42 } # dict >>> s = { 3, 3, 3, 4, -1, True, 94 } # set >>> b = b'\xe2\x82\xac1 = \xc6\x922.20' # bytes ('€1 = ƒ2.20') >>> ab = Addressbook( Person('Alice'), \ # an custom `Addressbook` ... Person('Bob'), \ # class, containing ... Person('Carol') ) # a collection of `Persons`
Challenges:
- The internal implementations of these data structures are different. HOW TO make sure that we can loop through all of them?
- New collection classes will have yet different implementation. HOW TO make sure that existing code can work with these new data types?
Python
is a high-level programming language, HOW TO make sure that there is a uniform and simple way to loop through these collections, without a lot of "boilerplate code".
Solution: The "Iterator Protocol", i.e.: a few simple rules that govern the "conversation" between components to get all elements of a collection.
Example: the code
for e in [1,2,3,4]: print(e)
involves the concerted "effort" of 3 different parties:- the
for
loop, - an (implicit) "Iterator", and
- the "Iterable" (of
list
)
- the
A general concept implemented in many modern programming languages. The most fundamental steps:
Each container data type (the Iterable, e.g.
list
[1,2,3,4]
) must implement its own "Iterator", which can retrieve all elements of the Iterable objectThe Iterator must be produced upon "request", i.e. in Python the return value of the
iter()
function:>>> myiterator = iter([1,2,3,4])
The component (e.g.
for
loop), which wants to access the elements of the Iterable ([1,2,3,4]
) will use the Iterator (myiterator
) to access the elements. InPython
the component will invoke thenext()
function to receive the next unseen element:>>> next_elem = next(myiterator)
The component will repeat the
next()
call, until the Iterator (myiterator
) signals that there is no element left. InPython
this signalling occurs by the iterator raising theStopIteration
exception:>>> next(myiterator) 2 # the next element >>> next(myiterator) 3 # the next element >>> next(myiterator) 4 # the next and last(!) element >>> next(myiterator) # nothing more left! Traceback (most recent call last): File "<stdin>", line 1, in <module> StopIteration
See also: Python 3 documentation on an Iterator
manually driving the loop, the programmer walks through the elements of the
str
Iterable using astr_iterator
Iterator:>>> s = 'Beautiful is better than ugly.' # Iterable: str >>> str_i = iter(s) # Iterator: str_i >>> type(str_i) <class 'str_iterator'> >>> next(str_i) # "Next element, please!" 'B' >>> next(str_i) 'e' >>> next(str_i) # ... repeat until done. 'a'
list
Iterable provides thelist_iterator
Iterator to loop through the content onenext()
call at a time:>>> l = [ 1, 2, 92, -32, 'python', 3.1415 ] >>> myiterator_l = iter(l) >>> type(myiterator_l) <class 'list_iterator'> # an iterator specific # for the `list` data type >>> next(myiterator_l) # same behavior 1
the
sum()
function drives an implicit loop to sum up thedict
object's keys using adict_keyiterator
Iterator:>>> sum({ 3: 'three', 1: 'one', 10: 'ten'}) 14 >>> iter({ 3: 'three', 1: 'one', 10: 'ten'}) <dict_keyiterator object at 0x7f69e460af48> >>> s = { 3, 3, 3, 4, -1, True, 94 } # set
the
min()
function drives an implicit loop to find the minimal value of theset
object using a :>>> s = set('Python') # convert a `str` to `set` >>> s {'t', 'n', 'o', 'h', 'P', 'y'} # can you guess why scrambled? >>> min(s) # give me the "smallest" element 'P' >>> type( iter( {'t', 'n', 'o', 'h', 'P', 'y'} ) ) <class 'set_iterator'>
in
Python
theiter()
function calls the Iterable's.__iter__()
method to get the Iterator object:>>> t = (1,2,3,4) >>> tuple_i = t.__iter__() >>> type(tuple_i) <class 'tuple_iterator'> >>> next(tuple_i) 1 >>> next(tuple_i) 2
the same Iterable can be looped through using multiple Iterators at the same time. Each Iterator stores its own position independently so they do not interfere with each other:
>>> s = 'Explicit is better than implicit.' # str Iterable >>> str_i1 = iter(s) # 1st Iterator >>> str_i2 = iter(s) # 2nd Iterator >>> str_i1, str_i2 (<str_iterator object at 0x7f69e460ce80>, <str_iterator object at 0x7f69e460cf28>) >>> next(str_i2) # 2nd Iterator first 'E' >>> next(str_i2) # str_i2 gives 2nd element 'x' >>> next(str_i1) # str_i1: 1st element 'E' >>> next(str_i2) # we switch arbitralily 'p' >>> next(str_i1) # the Iterators keep track 'x' # of their position >>> next(str_i1) 'p' >>> next(str_i1) 'l' >>> next(str_i2) 'l'
A Generator:
represents a collection of objects, which are (usually) not in the
Python
process' memory, but are generated with an expression or function.Example: the numerical sequence of the Fibonacci numbers (the "hello world" of generator examples ;-)
This is an collection of
int
objects, which can be generated using an expression:f_next = f_last + f_2ndlast
is an Iterator, i.e.: upon request of the
next()
it will produce the next element.
Generators can be created by either - a generator function, or - a generator expression
Generator function: any function with the yield
keyword in it:
>>> def fibonacci(n): # doctest: +ELLIPSIS ... a, b = 0, 1 ... i = 0 ... while i < n: ... yield b ... a, b = b, a+b ... i += 1 >>> g = fibonacci(12) # generator is create, but not started >>> type(g) <class 'generator'> >>> next(g) # produce the next value() 1 >>> g.__next__() # next() invokes the .__next__() 1 # magic method of the generator >>> next(g) 3 >>> list(fibonacci(12)) # force the generator to produce [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144]
A generator expression: similar syntax as any comprehension, but with
round braces: (
)
:
>>> l = [3, 4, 5, 6, 7, 11] # some data >>> g = ( e**2 for e in l ) # a generator expression >>> type(g) <class 'generator'> >>> next(g) # like any other Iterator... 9 >>> next(g) 16 >>> next(g) 25
A nice summary of the above concepts (inspired by this article)
- keyword
for
: can loop through the elements of any Iterable - operator
in
: checks if an object is an element of a collection
str
,list
,tuple
,bytes
(sequence types) etc...dict
,set
(mapping types)file
, i.e.: the return value ofopen()
- custom classes, which implement the
.__iter__()
method
sorted()
: sort an Interable, optionally with a specific sorting functionreversed()
: reverse the order of the elements of an Iterablemin()
,max()
: return the minimal or maximal value offilter()
map()
itemgetter()
:from operator import itemgetter firstelem = itemgetter(0) l = [ [a,b,c] for a in 'abc' for b in '12345' for c in 'ATGC' ] firstelems = map(firstelem; l)
range()
:enumerate()
:
- Challenge
Sort a list with the days of the week (list of strings) in the correct order.
>>> days = 'Mon Sun Tue Fri Sat Sun Mon Tue Mon Wed Sat Thu Fri'.split() >>> days ['Mon', 'Sun', 'Tue', 'Fri', 'Sat', 'Sun', 'Mon', 'Tue', 'Mon', 'Wed', 'Sat', 'Thu', 'Fri']
- Problem
By default
sorted()
function will sort strings in alphabetical order (lexicographical order).>>> sorted(days) ['Fri', 'Fri', 'Mon', 'Mon', 'Mon', 'Sat', 'Sat', 'Sun', 'Sun', 'Thu', 'Tue', 'Tue', 'Wed']
- Solution
Use a sort function which will define the order of the elements:
>>> def day_sorter(day): ... # the desired order of the elements ... order = 'Mon Tue Wed Thu Fri Sat Sun'.split() ... # return the position of the current element in the `order` list ... return order.index(day) ... >>> sorted(days, key=day_sorter) ['Mon', 'Mon', 'Mon', 'Tue', 'Tue', 'Wed', 'Thu', 'Fri', 'Fri', 'Sat', 'Sat', 'Sun', 'Sun']
- Bonus
The same sorting function will also work with
min()
andmax()
>>> min(days) Fri # No! >>> min(days, key=day_sorter) Mon # Yes!
- Challenge
How many males and females are in the following
dict
?persons = [ {'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}, {'name': 'Thomas', 'age': 44, 'gender': 'm'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Dragomir', 'age': 54, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}, ]
- Solution
Count the number of values of the
gender
attribute using thecollections.Counter
class. But:Counter
needs the to be counted values in a sequence-like Iterable, e.g.:>>> from collections import Counter >>> Counter('abracadabra') Counter({'a': 5, 'b': 2, 'r': 2, 'c': 1, 'd': 1})
However the data is in a
dict
! So let's extract the required data with a smalllambda
function:>>> gender_data_iterator = map(lambda v: v['gender'], persons) >>> Counter(gender_data_iterator) Counter({'m': 5, 'f': 4})
- Bonus
What if the data is not clean?
persons2 = [ {'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Mark', 'age': 17, 'gender': 'M'}, {'name': 'Thomas', 'age': 44, 'gender': 'M'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'M'}, {'name': 'Dragomir', 'age': 54, 'gender': 'M'}, {'name': 'Jenny', 'age': 34, 'gender': 'F'}, {'name': 'Eline', 'age': 29, 'gender': 'F'}, ]
In this case the previous solution clearly would give the wrong answer:
>>> Counter( map(lambda v: v['gender'], persons2) ) Counter({'M': 4, 'F': 2, 'f': 1, 'm': 1, 'v': 1})
So, let's lower-case the values before counting:
lambda v: v['gender'].lower()
:>>> gender_data_iterator2 = map(lambda v: v['gender'].lower(), persons2) >>> Counter(gender_data_iterator2) Counter({'m': 5, 'f': 4})
To see how this works, let's examine just the
lambda
function.- The raw data record: ::
>>> persons2[3] {'name': 'Thomas', 'age': 44, 'gender': 'M'}
When we apply the
lambda
function to the raw data:>>> f = lambda v: v['gender'].lower() >>> f(persons2[3]) 'm'
- Problem
- Given the
persons
(see)dict
of the previous example, sort the persons into age groups of decades, that is: 0-9, 10-19, 20-29, 30-39 etc... - Analysis
Let's define the desired output of our program as a
dict
, where:the keys: are the age buckets, expressed by a
range
object, e.g.:range(10)
. This corresponds with the ages of 0-9.values: are
list
, which contain the persons, who fall in the age bucket, e.g.:{ range(10, 20): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}], range(20, 30): [{'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}], range(30, 40): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}] }
- Solution
We'll need a
list
(ortuple
), which contains the differentrange
objects, against which the program will examine a persondict
, e.g.:categories = (range(9), range(10,20), range(20, 30), range(30, 40))
Also needed is an empty
dict
, which will store the result:res = {}
Finally, a nested loop will walk through the
persons
dict
and match the age of the currentperson
against eachrange
object:for r in categories: for p in persons: if p['age'] in r: res.setdefault(r, []).append(p)
The line code:res.setdefault(r, []).append(p) is perhaps the most intriguing here. Let's break this down:
.setdefault()
method will return one of the following values:- the value of the key
r
(i.e.: alist
), ifr
is an existing key inres
, OR - add the key
r
with an emptylist
as value tores
AND return this emptylist
object, ifr
was not yet a key
- the value of the key
- in either of the above cases, the expression
res.setdefault(r, [])
will return alist
, to which we append the current persondict
as a new element.
- Bonus
This algorithm will accept any arbitrary age buckets, even if they overlap. Observe the extended
categories
, where we added the age groups representing: elementary school children, high-school children, adults and retirees:categories = (range(9), range(10,20), range(20, 30), range(30, 40), range(6, 15), range(15, 19), range(19, 67), range(67, 120)) def categorize(persons, categories): res = {} for r in categories: for p in persons: if p['age'] in r: res.setdefault(r, []).append(p) return res print(categorize(persons, categories))
The result is:
{range(10, 20): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}, {'name': 'Mark', 'age': 17, 'gender': 'm'}], range(20, 30): [{'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}], range(30, 40): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}], range( 6, 15): [{'name': 'Lucy', 'age': 14, 'gender': 'f'}], range(15, 19): [{'name': 'Mark', 'age': 17, 'gender': 'm'}], range(19, 67): [{'name': 'Andrej', 'age': 34, 'gender': 'm'}, {'name': 'Thomas', 'age': 44, 'gender': 'm'}, {'name': 'Evi', 'age': 25, 'gender': 'f'}, {'name': 'Robert', 'age': 23, 'gender': 'm'}, {'name': 'Dragomir', 'age': 54, 'gender': 'm'}, {'name': 'Jenny', 'age': 34, 'gender': 'f'}, {'name': 'Eline', 'age': 29, 'gender': 'f'}] }
- Problem
- Create the
Addressbook
class, which is a collection ofPerson
instances. Make sure that theAddressbook
instances are Iterable. - Solution
Let's use the recently introduced
dataclasses
module to create the classes.from dataclasses import dataclass, field from typing import List @dataclass class Person: fname: str = '' sname: str = '' gender: str = '' email: str = '' @dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False)
At this point the
Addressbook
instances can hold items, but it is not yet an Iterable:>>> ab = Addressbook() >>> ab Addressbook(name='My Addressbook', _items=[]) >>> ab2._items += [ 'Jenny', 'Robert', 'Alice' ] >>> ab2 Addressbook(name='My Addressbook', _items=['Jenny', 'Robert', 'Alice'])
However it is not yet an Iterable:
>>> list(ab2) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'Addressbook' object is not iterable
In
Python
a class is only Iterable if it implements the.__iter__()
method, which provides an Iterator instance. So let's do it:@dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items)
It is now working:
>>> ab2 = Addressbook() >>> ab2._items += [ 'Jenny', 'Robert', 'Alice' ] >>> ab2 Addressbook(name='My Addressbook', _items=['Jenny', 'Robert', 'Alice']) >>> list(ab2) # convert Addressbook -> list ['Jenny', 'Robert', 'Alice']
While at it why don't we add a couple of other nice features, such as:
- implement the
.add()
method, which will add an item to the address book - implement the
.__len__()
method, so that thelen()
function is able to show the number of elements in the collection.
@dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items) def add(self, person): self._items.append(person) def __len__(self): return len(self._items)
Try out the result:
>>> fred = Person(fname='Fred', sname='Flintstone', gender='m', email='[email protected]') >>> wilma = Person(fname='Wilma', sname='Flintstone', gender='f', email='[email protected]') >>> ab = Addressbook(name='The Flintstones') >>> ab.add(fred) >>> ab.add(wilma) >>> print(f'Number of entries in addressbook: {len(ab)}') 2
- implement the
- Bonus
By implementing the
.__getitem__()
magic method on thePerson
class, we even can use the previous solution to count:@dataclass class Person: fname: str = '' sname: str = '' gender: str = '' email: str = '' def __getitem__(self, item): res = getattr(self, item) return res @dataclass class Addressbook: name: str = 'My Addressbook' _items: List[Person] = field(default_factory=list, init=False) def __iter__(self): return iter(self._items) def add(self, person): self._items.append(person) def __len__(self): return len(self._items) fred = Person(fname='Fred', sname='Flintstone', gender='m', email='[email protected]') wilma = Person(fname='Wilma', sname='Flintstone', gender='f', email='[email protected]') ab = Addressbook(name='The Flintstones') ab.add(fred) ab.add(wilma)
Finally let's try how our new classes fit in our data-processing toolkit so far :
gender_data_iterator = map(lambda v: v['gender'], ab) >>> Counter(gender_data_iterator) Counter({'m': 1, 'f': 1})