title | author | date | licence |
---|---|---|---|
Solving practical problems with Python - Part 2 |
Gábor Nyers <[email protected]> |
2022-09-18 |
CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/ |
The profiler
program from the previous session performs its
main function but lacks almost all software engineering practices that makes a program
robust enough for general use. In this session we will implement a number of these
necessary features, like the ability to control the program's behavior with
CLI arguments and provide
logging.
We will also solve the challenge described at the end of the
previous session.
Audience: This session -- like the previous one -- is meant for the novice Python programmer You may code along or just focus on how to approach the problem and build a solution.
For more information on its features see the previous session.
-
Goal: recursively analyze the content of a directory
-
Print information about:
- the number of filesystem objects, e.g.: files, directories, symbolic- and hard links; and
- the disk space used by all the above elements
See Example output
-
The source code of
profiler
lists a few of shortcomings of the implementation:**Challenge:** The method used to calculate the `total_disk_usage` does not take hardlinks into account. (1) Why? (2) Based on the `distict_files` dictionary, you could correct the value of `total_disk_usage` to calculate the **exact** disk usage. How?
$ ./profiler exampledir
--- Statistics of directory: exampledir
Number of errors encountered while processing 0
The number of dirs: 6
The number of non-dirs (i.e.: files, hard- and symlinks): 8
The number of symlinks: 1
The number of hardlinks: 3
Used disk space: 524 bytes
Which of the hardlinks point to the same file?
exampledir/file1.dat, exampledir/a/file2.bin, exampledir/c/e/file15 : 7 bytes
The content of the exampledir
directory is the following:
$ tree -F exampledir/
exampledir/
├── a/
│ ├── b/
│ │ ├── d/ # + symbolic link
│ │ │ ├── file8 # | (ie.: special file
│ │ │ └── file9 -> ../../../c/file13 # <--+ pointing to a path)
│ │ └── file3 #
│ ├── file2.bin # <--+ hard links
│ └── file5 # | (ie.: different names
├── c/ # | point to the same
│ ├── e/ # | content, which is
│ │ └── file15 # <--+ stored only once
│ └── file13 # | on disk)
└── file1.dat # <--+
Currently the total disk usage is being calculated with this list-comprehension expression:
total_disk_usage = sum([ attrs.get('disk_usage',0)
for path,attrs in data.items()
if attrs['isdir']
])
Why is this expression incorrect?
In the calculation of the total disk usage above expression will disregard the fact that hard links are just names pointing to the same inode (i.e.: file content).
Instead of counting the size only once, the current expression adds the file's
size to the total for every hardlink, i.e.: file2.bin
, file15
and file1.dat
NOTE: See profiler.1 for the implemented solution!
-
We need an auxiliary data structure that contains the inodes of all non-directory filesystem objects.
The auxiliary data structure could be either:
-
a
set
, e.g:deduped_non_dir_inodes = { attrs['inode'] for path, attrs in data.items() if not attrs['isdir'] } # deduped_non_dir_inodes: # {217065955, 40485989, 40485993, 298461643, 40485982, 40485983}
-
or a
dict
, e.g.:deduped_non_dir_inodes = { attrs['inode']:path for path, attrs in data.items() if not attrs['isdir'] } # deduped_non_dir_inodes: # { 40485993: '../session09/exampledir/c/e/file15', # 40485989: '../session09/exampledir/a/file5', # 217065955: '../session09/exampledir/a/b/file3', # 40485982: '../session09/exampledir/a/b/d/file9', # 40485983: '../session09/exampledir/a/b/d/file8', # 298461643: '../session09/exampledir/c/file13' }
The
dict
option is more suitable for this case, because we already have the dictionarydata
.
The keys of which are the filesystem objects'path
. So by combining the two, we can easily retrieve thesize
of each unique non-directoryinode
:>>> deduped_non_dir_inodes[217065955] '../session09/exampledir/a/b/file3' >>> data['../session09/exampledir/a/b/file3']['size'] 153
-
-
The calculation should only take into account the sizes of those inodes mentioned in the auxiliary data structure.
Modify the expression that calculates the total disk usage:
deduped_non_dir_inodes = {attrs['inode']: path for path, attrs in data.items() if not attrs['isdir'] } total_disk_usage_corrected = sum([ data[path]['size'] for inode,path in deduped_non_dir_inodes.items()])
NOTE: that the calculation now loops through the auxiliary data structure
deduped_non_dir_inodes
.
NOTE: See profiler.1 for the implemented solution!
It could be argued that the original expression needs only to be slightly modified and
extended with a simple check attrs['refcount'] < 2
, e.g.:
# Incorrect solution:
total_disk_usage = sum([ attrs['size']
for path,attrs in data.items()
if not attrs['isdir']
and attrs['refcount'] < 2 ]) # <-- check of refcount
Why is this incorrect:
-
Most importantly: above expression will disregard all hard-links and will not count the size of any them! For the correct calculation of the total disk usage the size of every inode needs to be added exactly once!
-
Also: there could be multiple different hard-links in the directory structure.
In this case
attrs['refcount']
only indicates that this filesystem object is a hard-link. It does not tells which inode is being referenced by the current element.
Since the session is meant for the novice programmer, let's recap the purpose and usage of command-line (CLI) arguments.
-
Primary purpose of CLI arguments:
From the early days of operating systems, CLI arguments are used to control a program's behavior by enumerating the required parameters on the command line. -
These arguments need to be interpreted (i.e.: "parsed") by the program immediately after start in order for it to configure its run-time parameters.
# <--(1)-> <-----------(2)---------------------->
# <-----------(3)--------->
# <--(a)--> <----(b)---->
# <--(i)--|---<-(ii)-
# <---(4)-->
profiler --verbose --format=json exampledir
# ->(5)<- ->(5)<- ->(5)<-
#
Where:
-
the name of the command being executed
-
all the arguments separated by spaces that will be passed by the OS to the program (1);
in Python the CLI arguments are available in the
sys.argv
(of typelist
) -
the CLI options for the program
3.a) the GNU-style long-named option
--verbose
that represents a boolean recognizable by the double dash prefix and the lack of a value3.b) the long-named option with a value:
--format=json
representing a key/value pair:i. the name of the option:
--format
, and i. its value:json
-
a positional argument, usually representing the main input for the program
-
at least a single space character to separate the elements of the command
-
"Unix-style" options:
-s -vvv -l prog.log
, or it's equivalents:-v 3 -a
,-l prog.log -dv3
-
"GNU-style" or "long-named" options:
--log prog.log --verbose=3 --all
-
"Windows-style" arguments:
/a /l prog.log /v
A few examples to illustrate the complexity of CLI argument parsing:
-
the options and arguments may be specified in any order:
-a -v
or-v -a
-
each option separated
-a -v
or compacted form-av
-
options may have an optional value, e.g.:
--debug 3
(increase debug level to 3, instead of 1) -
the same run-time parameter may have a "short" (
-l prog.log
) and "long" (--log prog.log
) name; with- or without the=
sign:--log=prog
-
dealing with corner-cases, e.g.:
-
distinction between an option's value and a positional argument, e.g.:
... --output json ...
(isjson
a value to the option--output
or is it a positional argument?) -
passing a special character as a value to an option, e.g.:
cut -d "-"
-
delete the file with an unusual name, e.g.:
--somefile
typically solved by the introduction of a "sentinel" token, e.g.:
rm -- --somefile
, where--
indicates that after it there are no more options, only positional arguments.
-
Because of the complexity outlined above, it is rarely a good idea to spend any effort writing your own CLI parser. Python already provides a wealth of readily available and high-quality libraries that will fit your needs.
An additional benefit is that most of these libraries will not only parse the CLI arguments, but will perform argument validation.
A few examples of validation:
--log=prog.file
: will check whether or not the fileprog.file
is writeable?--age=32
: will only accept positive integers in the range 0-120 as option value--fqdn=www.example.com
: does the value match the notation of a "fully qualified domain name"?--color=red
: verify that the value is one of a limited number of choices, e.g.: red, green or blue
A few useful Python libraries for CLI argument parsing:
-
argparse
: a modern implementation that is part of the Python Standard LibraryFor additional examples see also the Argparse tutorial.
-
getopt
: the legacy argument parser of the Standard Library, very similar to the standard argument parsing functiongetopt
provided by both the GlibC library and thebash
shell. -
docopt
: an excellent 3rd-party library that allows the declaration of the CLI parsing features just by describing them using the documentation notation rules. (see this example and demo) -
"Command Line Interface Creation Kit" or
click
: a fully featured 3rd-party library for developing sophisticated command-line oriented applications.
-
[require a single mandatory positional argument called
DIR
, that is interpreted as a directory's path to be analyzed]{ #req-1 } -
[verify that the path in
DIR
exists and be a directory]{ #req-2 } -
[provide a description of above options and arguments if the program is invoked with the
--help
or-h
options.]{ #req-3 }
The following example demonstrates the usage of the argparse
module.
{% include("argparse-demo.py") %}
See also: argparse-demo.py
-
single positional argument
DIR
, must point to a directory:Executing the program without any arguments will not work:
$ ./argparse-demo.py usage: argparse-demo.py [-h] DIR argparse-demo.py: error: the following arguments are required: DIR
-
DIR
MUST:exist:
$ ./argparse-demo.py /path/to/non-existing/dir usage: argparse-demo.py [-h] DIR argparse-demo.py: error: argument DIR: /path/to/non-existing/dir is not a directory
be a directory:
$ ./argparse-demo.py /etc/hosts usage: argparse-demo.py [-h] DIR argparse-demo.py: error: argument DIR: /etc/hosts is not a directory
-
Program description:
$ ./argparse-demo.py --help usage: argparse-demo.py [-h] DIR Simple argparse demo for the `profiler` program positional arguments: DIR path to directory to analyze optional arguments: -h, --help show this help message and exit That's all folks!
NOTE: See profiler.2 for the implemented solution!
Demo 1: wrong arguments
$ ./profiler.2 /path/to/non-existent/dir
usage: profiler.2 [-h] DIR
profiler.2: error: argument DIR: /path/to/non-existent/dir is not a directory
Demo 2: help function
$ ./profiler.2 -h
usage: profiler.2 [-h] DIR
Recursively analyze a directory structure and provide some stats about the
disk space usage of the entire content, the number of files, directories, hard-
en symlinks.
Usage:
./profiler exampledir
Example output of a simple directory structure:
--- Statistics of directory: exampledir/
Number of errors encountered while processing 0
The number of dirs: 6
The number of non-dirs (i.e.: files, hard- and symlinks): 8
The number of symlinks: 1
The number of hardlinks: 3
Used disk space: 524 bytes
Which of the hardlinks point to the same file?
exampledir/file1.dat, exampledir/a/file2.bin, exampledir/c/e/file15 : 7 bytes
positional arguments:
DIR path to directory to analyze
optional arguments:
-h, --help show this help message and exit
That's all folks!
NOTE: See profiler.2 for the implemented solution!
Purpose of logging:
To be able to follow the working of a program or identify malfunctions, it needs to emit
information about its internal state.
NOTE:
-
While sometimes used as a synonym, debugging means something different: it involves some external tool (i.e.: a debugger), that will be actively used by (usually) a developer to extract ad-hoc information from the application in real time.
-
Logging, on the other hand, is done by the application itself during run-time. It emits predetermined messages at predetermined points of its flow.
-
To be able to control the type and amount of log messages, some additional logic needs to be implemented by the program.
A few common ways of logging:
-
using the
print()
function (or its variant:pprint()
, "p" as in "pretty-print") -
the
logging
module in the Python Standard LibraryThe logging` module provides numerous sophisticated features, e.g.:
-
multiple logging sources, each with its own configurable threshold to suppress or allow messages and own (set of) handlers
-
customizable handlers which deliver log messages, e.g. to the console, a file, a local or remote syslog server, etc..
-
sophisticated filters to allow or supress messages based on customizible criteria (source, handler, time of day, etc...)
-
customizable formatters to control how a log message is being presented. (see also: log attributes)
-
The built-in logging
module is feature-rich, robust and
well-tested. logging
has borrowed many ideas from the
log4j product.
NOTE: See profiler.3 for the implemented solution!
-
Provide the following log messages:
-
In case of an exception during the analysis of the directory structure emit an
error
message (meaning: the level error as defined inlogging
). -
Emit an
info
message when entering a directory to be processed. The message should mention the number of objects that have been analyzed so far. -
Each message should record the current timestamp (formatted in the ISO notation, e.g.: "2022-09-27T19:00:00"), error level and the message
-
-
The program should be able to log to the terminal (default) or a file.
-
Add the following CLI options to
-l
or --loglev
CLI option that will configure the threshold when to suppress
or emit a message
NOTE: the logging
module has 5 different
predefined levels
to distinguish between the urgency of a message: DEBUG
, INFO
, WARNING
, ERROR
and CRITICAL
A basic logging example:
{% include("logging-demo.py") %}
See also: logging-demo.py
Example output:
$ ./logging-demo.py
2022-09-27T02:04:38CEST WARNING (logging-demo:25) This is an WARNING level message
2022-09-27T02:04:38CEST ERROR (logging-demo:26) This is an ERROR level message
2022-09-27T02:04:38CEST CRITICAL (logging-demo:27) This is a CRITICAL level message
NOTE:
-
the
DEBUG
andINFO
messages has been suppressed because of the providedloglev
attribute inbasicConfig()
-
the timestamp is displayed in the format as stated in requirement 1.3
NOTE: See profiler.3 for the implemented solution!
-
support the CLI options
--loglev
and--logfile
:$ ./profiler.3 --help ... positional arguments: DIR path to directory to analyze optional arguments: -h, --help show this help message and exit --loglev {debug,info,warning,error,critical} required log level (default: warning) --logfile LOGFILE output the logs to this file (default: stderr) That's all folks!
-
Provide info messages when entering a directory and change the amount of emitted information with the
--loglev
option:$ ./profiler.3 --loglev=info ../session09/exampledir/ 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/ 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/a 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/a/b 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/a/b/d 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/c 2022-09-27T19:48:53CEST INFO (profiler:127) Entering directory: ../session09/exampledir/c/e --- Statistics of directory: ../session09/exampledir/ Number of errors encountered while processing 0 ...
-
Log to the terminal or a file:
$ ./profiler.3 --loglev=info --logfile=profiler.3.log ../session09/exampledir/ --- Statistics of directory: ../session09/exampledir/ Number of errors encountered while processing 0 The number of dirs: 6 The number of non-dirs (i.e.: files, hard- and symlinks): 8 The number of symlinks: 1 The number of hardlinks: 3 Used disk space: 524 bytes Used disk space (corrected): 510 bytes Which of the hardlinks point to the same file? ../session09/exampledir/file1.dat, ../session09/exampledir/a/file2.bin, ../session09/exampledir/c/e/file15 : 7 bytes
NOTE:
-
Command same as above plus the
--logfile=profiler.log
option. Notice that there are no log messages (e.g.: "... INFO ...") -
The file
profiler.3.log
does contain those log messages:$ head -3 profiler.3.log 2022-09-27T19:51:04CEST INFO (profiler:127) Entering directory: ../session09/exampledir/ 2022-09-27T19:51:04CEST INFO (profiler:127) Entering directory: ../session09/exampledir/a 2022-09-27T19:51:04CEST INFO (profiler:127) Entering directory: ../session09/exampledir/a/b
-
NOTE: See profiler.3 for the implemented solution!
<script src="../assets/js/script.js" defer></script>