Skip to content

Latest commit

 

History

History

session02

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Python Tuesday: Session 2

Working with text files and file formats

Date: 2019-08-27
Author: Gábor Nyers
tags:python
category:python_workshop
summary:A demonstration of reading and writing information to files in different formats
licence:CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/

Agenda

Prep

Before being able to access any file Python needs to open it.

  1. The absolute minimum:

    (Example as executed in an interactive Python session or Python REPL)

    >>> # open the file 'names.txt' for reading, assume a text file
    >>> fh = open('names.txt')
    >>>

    What this means:

    • fh: will be the variable pointing to a new file handler object which is the result of the open() function
    • open(): function will create a new file handler object to the names.txt file
    • (allow read-only access to the file, because of missing second optional mode argument)
  2. Open a file for writing only, for example to write some data to a new file:

    >>> # open the file 'names.txt' for writing
    >>> fh = open('names.txt', 'wt')
    >>>

    The above instruction will perform the same as above, except for the following:

    • "wt": an optional argument for the open() function called the mode string. This is a series of letter codes each with its specific meaning as to how Python should request access to the file.

      • "w" request write-only access, create the file or truncate its content, if it already exists
      • "t" interpret the file's content as Unicode text

      See also: [pydoc_open]) for other letters and their meaning

  3. The mode string governs how the file is accessed:

    Depending on the use-case we may want to access a file in many different ways:

    1. Load a configuration file: read-only, read once, content as text
    2. Export a CSV file: write-only, write once, content as text
    3. Read a picture from a PNG file: read-only, read once, binary content
    4. Append a new record to a log file: write-only, append to the end of a file (without overwriting the rest!), content as text
    5. Modify 3 records in a big database file: read and write multiple times, seek to a different positions within the file, binary content
  1. The dangers of the "w" or "w+" mode:

    The letter codes "w" and "w+" in the open() functions mode string will both instruct Python to truncate an existing file. That is: all existing content will be lost and may be only be recovered from an existing backup.

  2. Depending on your use case it may be safer to use the "x" (or its variant "x+") letter code instead. In this case, if the given file exists, Python will throw an exception:

    1 >>> fh = open('names.txt', 'xt')
    2 Traceback (most recent call last):
    3   File "<stdin>", line 1, in <module>
    4 FileExistsError: [Errno 17] File exists: 'names.txt'
    5 >>>

    This mode is the only safe way to handle files

Suppose we have the following list of names and want to write them to a file one name per line:

Hayley Peter Chris Stan Brian Lois Marge Stewie Francine Meg
1 #!/usr/bin/env python3
2 names = 'Hayley Peter Barney Stan Brian Lois Marge Stewie Francine Wilma'
3 
4 fh = open('names.txt', 'wt')        # create empty file with name "names.txt"
5 names_l = names.split()             # split long ``str`` into a ``list``
6 for name in names_l:                # loop through the list of names
7     fh.write(name + '\n')           # write current name + '\n' (new line)
8 fh.close()                          # close file

So, what has happened here:

  • line 1: special "Shebang" line, instructing the OS what interpreter to execute this file with

  • line 2: create a new str object containing the names and bind the names variable to it

  • line 4: (re-)create new empty file with the name "names.txt" (Remember: an existing file's data will be deleted!)

  • line 5: split-up the long str object into multiple shorter str and gather them into a new list object. Because the .split() method didn't received an argument, by default the splitting will occur at the any of the following characters: ' ' (space), 't' (tab) and 'n' (new line)

  • line 6: loop through the elements of the names_l list object one element at a time. In each iteration the current element is bound to the name variable

  • line 7: in each iteration write a str containing the current name and a "\n" (new line) character to the file represented by the fh file handler object; in our case the names.txt file.

    This line will be executed for each element of the names_l list, i.e.: 10 times.

  • line 8: close the file

Now that we have created the names.txt file let's read the data. We have more than one way to do this:

  1. The simplest way to read the content of the file is to read the whole content into memory, such as:

    >>> fh = open('names.txt')
    >>> content = fh.read()
    >>> type(content)
    <class 'str'>
    >>> content
    'Hayley\nPeter\nBarney\nStan\nBrian\nLois\nMarge\nStewie\nFrancine\nWilma\n'

    The file's content is now in a str object, which when printed produces to following output:

    >>> print(content)
    Hayley
    Peter
    Barney
    Stan
    Brian
    Lois
    Marge
    Stewie
    Francine
    Wilma
    >>> fh.close()
  2. More often than not we want to read text files line-by-line:

    >>> fh = open('names.txt')
    >>> for line in fh:
    ...     print(line)              # doctest: +ELLIPSIS
    Hayley
    
    Peter
    
    Barney
    
    Stan

    Please note the double spaced output! This is the consequence of the default behaviors. On the one hand when reading a line, the file handler leaves the "\n" (new line) character intact at the end of the line. Verify this by typing the line variable, which still contains the last line:

    >>> line
    'Wilma\n'
    >>> fh.close()                   # close the file

    Additionally, the print() function automatically prints a "\n" character, resulting in double spaced printout.

    The following is a solution, where the end='' argument instructs the print() function to print an empty str at the end of the line:

    >>> fh = open('names.txt')
    >>> for line in fh:
    ...     print(line, end='')
    ...
    Hayley
    Peter
    Barney
    Stan
    Brian
    Lois
    Marge
    Stewie
    Francine
    Wilma

In addition to (or in place of) the above interactive commands, we can collect these instructions into a Python program file:

1 #!/usr/bin/env python3
2 
3 fh = open('names.txt')
4 for line in fh:
5     print(line, end='')

The program may be executed directly from your IDE or by entering the following in a terminal:

python3 read-names-from-txt.py

The encoding of text files becomes a concern once we want to read and write text files with non-ASCII characters, i.e.: letters and symbols which are not used in the English writing system. (see [ASCII1967]). Here are a few examples:

  • international characters: Français, Español, Português, Plattdüütsch, ελληνική, Русский, שפה עברית, հայոց լեզու, 普通話
  • emoticons: ☺(grinning face) ☹(sad face)
  • and symbols: ❄(snowflake) ✌(V sign) €(euro sign) ⚕(medicine) ☮(peace sign)

The multitude (dozens!) of character pages and encoding standards used to make working with -- and especially exchanging -- textual data outside English speaking countries a daunting challenge. The solution has been gradually implemented came about in de first decade of the '00s with the widespread adoption of the Unicode standard (see [Unicode2001]).

In Python 3 the built-in str datatype (and a few others) has been re-implemented and strings are now 100% handled as [Unicode2001]. Data interchange -- i.e.: reading and writing text data -- now typically works to a degree that users hardly notice it's there.

Important takeaway:

When exchanging textual data, such as reading from or writing to a file, as a programmer you need to indicate to Python that it should handle the data as text. This requires a few additional steps before writing or after reading, which Python will take care of automatically, such as:

  • encoding, i.e.: converting from the str datatype to raw data and
  • decoding, i.e.: converting from raw binary data to the str type

In our earlier examples we did this using the "t" letter code in the open() function.

At this point we have covered the fundamental of reading and writing text files. The rest of the session we will spend on the most popular formats which are used to store data in text files.

The CSV format (see [CSVformat]) is a frequently used, application and platform independent format to exchange tabular data. A typical example:

1 name|full_name|group|gendergroup|agegroup
2 fred|Fred Flintstone|flintstones|m|adults
3 wilma|Wilma Flintstone|flintstones|f|adults
4 pebbles|Pebbles Flintstone|flintstones|f|kids

This data represents the following table:

Cartoon characters
name full_name group gendergroup agegroup
fred Fred Flintstone flintstones m adults
wilma Wilma Flintstone flintstones f adults
pebbles Pebbles Flintstone flintstones f kids

A few noteworthy points about the above example:

  • the first row contains the names of the columns
  • the delimiter is the '|' (vertical bar) character
  • the data consist of 3 rows and 5 columns
  • the strings are not quoted

In Python there are at least a handful of ways and modules to process CSV files. We will focus here on the most obvious one: the "Python Standard Library's" csv (see [pydoc_csv]) module.

Let's take the above example and create a CSV file from it.

 1 #!/usr/bin/env python3
 2 
 3 import csv
 4 
 5 # The CSV data
 6 names='''
 7 name|full_name|group|gendergroup|agegroup
 8 fred|Fred Flintstone|flintstones|m|adults
 9 wilma|Wilma Flintstone|flintstones|f|adults
10 pebbles|Pebbles Flintstone|flintstones|f|kids
11 '''
12 
13 # convert the ``names`` str to a list of lists
14 data = names.strip()           # remove white-space chars from both ends
15 data = data.split('\n')        # split ``str`` into lines, returns a ``list``
16 data = [ line.split('|') for line in data ]  # split all rows into its fields
17 
18 # the ``data`` variable now points to a list object, each of whose element
19 # is a list:
20 # data = [
21 #   ['name', 'full_name', 'group', 'gendergroup', 'agegroup'],
22 #   ['fred', 'Fred Flintstone', 'flintstones', 'm', 'adults'],
23 #   ['wilma', 'Wilma Flintstone', 'flintstones', 'f', 'adults'],
24 #   ['pebbles', 'Pebbles Flintstone', 'flintstones', 'f', 'kids']
25 # ]
26 
27 # Now let's write this out to the file ``names.csv``
28 with open('names.csv', 'wt') as fh:
29    csv_w = csv.writer(fh, dialect='excel', delimiter='|')
30    csv_w.writerows(data)

The csv module's "Dialects and Formatting Parameters" section (see [pydoc_csv_formatting]) provides more information about additional bells and whistles when exporting data to CSV, e.g.:

  • quoting: whether or not to quote strings
  • escapechar: how to escape characters in the data, which coincide with the delimiter character
  • etc ...

Execute this program by entering:

python3 write-names-as-csv.py

and verify the file it has produced:

cat names.csv

name|full_name|group|gendergroup|agegroup
fred|Fred Flintstone|flintstones|m|adults
wilma|Wilma Flintstone|flintstones|f|adults
pebbles|Pebbles Flintstone|flintstones|f|kids

Now that we have an example CSV example, we can re-create the Python data structure from the data:

1 #!/usr/bin/env python3
2 
3 import csv
4 with open('names.csv') as fh:
5     csv_r = csv.reader(fh, dialect='excel', delimiter='|')
6     data = list(csv_r)
7 print(data)

The steps:

  • line 3: load the csv module

  • line 4: the with statement is an improved way of using (amongst others) the open() function, which will automatically close the file handler if Python is done with the code block (lines 5 and 6)

    For detailed information on this construct see the [pep343] or search for the term "python context manager".

  • line 5: create a new CSV reader object with the specified details about the delimiter and CSV dialect

  • line 6: convert the data represented by the CSV reader to a list object

  • line 7: print out the data

The INI format (see [INI_format]) is capable of representing information organized in a tree structure, which lends itself well for its main use case: configuration files. Besides that the INI format can also be used for data exchange.

Similarly to the CSV format despite of lacking an official standard, it has been in use for decades and as a result has a multitude of (slightly inconsistent) implementations.

In terms of the format's details, the content is divided into sections, which in turn is a listing of properties and their associated values.

Python has an implementation in the "Python Standard Library" in the module configparser (see [pydoc_configparser]).

In Python terminology, while the CSV format is well-suited for storing list-like data, the INI format is a good choice for storing dict-like data.

In this section we will be working with the data represented by the following dict object:

names = {
          'kids': {
                    'Chris': 'Family Guy',
                    'Pebbles': 'The Flintstones',
                    'Bart': 'The Simpsons'
                  },
          'adults': {
                      'Fred': 'The Flintstones',
                      'Betty': 'The Flintstones',
                      'Homer': 'The Simpsons',
                      'Lois': 'Family Guy'
                    },
          'other': { 'Klaus': 'American Dad',
                     'Brian': 'Family Guy',
                     'Roger': 'American Dad'
                   }
        }

The following is one of the simplest solution to export to an INI file:

 1 #!/usr/bin/env python3
 2 
 3 import configparser
 4 
 5 names = {
 6           'kids': {
 7                     'Chris': 'Family Guy',
 8                     'Pebbles': 'The Flintstones',
 9                     'Bart': 'The Simpsons'
10                   },
11           'adults': {
12                       'Fred': 'The Flintstones',
13                       'Betty': 'The Flintstones',
14                       'Homer': 'The Simpsons',
15                       'Lois': 'Family Guy'
16                     },
17           'other': { 'Klaus': 'American Dad',
18                      'Brian': 'Family Guy',
19                      'Roger': 'American Dad'
20                    }
21         }
22 ini = configparser.ConfigParser()
23 ini.update(names)
24 with open('names.ini', 'wt') as fh:
25    ini.write(fh)

When executing this example, it creates the names.ini file with the following content:

[kids]
chris = Family Guy
pebbles = The Flintstones
bart = The Simpsons

[adults]
fred = The Flintstones
betty = The Flintstones
homer = The Simpsons
lois = Family Guy

[other]
klaus = American Dad
brian = Family Guy
roger = American Dad

Note the lower-case key names (e.g.: 'chris', 'pebbles' etc...). This is the default behavior of the ConfigParser class, since the original implementation of the configparser module tried to adhere the INI format used on Windows. Windows is case-insensitive, hence the class' default behavior.

With the following slight modification we can preserve the upper-case letters:

 1 #!/usr/bin/env python3
 2 
 3 import configparser
 4 
 5 names = {
 6           'kids': {
 7                     'Chris': 'Family Guy',
 8                     'Pebbles': 'The Flintstones',
 9                     'Bart': 'The Simpsons'
10                   },
11           'adults': {
12                       'Fred': 'The Flintstones',
13                       'Betty': 'The Flintstones',
14                       'Homer': 'The Simpsons',
15                       'Lois': 'Family Guy'
16                     },
17           'other': { 'Klaus': 'American Dad',
18                      'Brian': 'Family Guy',
19                      'Roger': 'American Dad'
20                    }
21         }
22 ini = configparser.ConfigParser()
23 ini.optionxform = str               # make sure to preserve case!
24 ini.update(names)
25 with open('names-case-preserved.ini', 'wt') as fh:
26    ini.write(fh)

A few details of this improved version:

  • line 3: load the configparser module
  • line 22: create a new ConfigParser object
  • line 23: make sure to preserve upper- and lower-cases in both section- and key names!
  • line 24: copy the data from the names dictionary object
  • line 25: open the output file (as a reminder: see [pep343] for more information on using context managers)
  • line 26: write the data to the output file

As usual, we'll try to read in the data from the file we just created.

1 #!/usr/bin/env pythone
2 
3 import configparser
4 ini = configparser.ConfigParser()
5 ini.optionxform = str               # make sure to preserve case!
6 files_read = ini.read(['names-case-preserved.ini'])
7 names = { section:dict(ini[section]) for section in ini.sections() }
8 print(names)

So let's unpack what has happened here:

  • lines 3, 4 and 5: load the configparser module, create a new ConfigParser object and make sure it preserves upper- and lower-case letters ; same as in the previous example

  • line 5: the .read() method is an interesting one... it is capable of reading, parsing and merging multiple INI files in one go.

    As its argument we provide a collection (in this case a list) of strings, which will be interpreted by the method as file names. The .read() method will try to read and parse them.

    The names of all successfully processed files will be provided as the elements of the list object it returns.

    Very convenient!

  • line 6: this is where we convert the ConfigParser object to a dict. This is not required, since we can access the data in the ini object as well. However for an easy comparison with what we've started it is convenient to see the data as a dict

    The conversion is done using a "dictionary comprehension" (see [pep274]), which is a convenient shorthand for a full-blown for loop.

    To unpack its working we could write the instruction up in a way which better indicates the details:

    names = {
     section                          # key of the new element is the section name
     :                                # required syntax
     dict(ini[section])               # value is the converted ``Section``
                                      # object to a ``dict``
     for section in ini.sections()    # loop through each section name
    }
  • line 7: display the data

When we execute this program we see the following:

python3 read-names-from-ini.py

{'kids': {'Chris': 'Family Guy', 'Pebbles': 'The Flintstones', 'Bart': 'The
Simpsons'}, 'adults': {'Fred': 'The Flintstones', 'Betty': 'The
Flintstones', 'Homer': 'The Simpsons', 'Lois': 'Family Guy'}, 'other':
{'Klaus': 'American Dad', 'Brian': 'Family Guy', 'Roger': 'American Dad'}}

Using the INI format for configuration data is not significantly different and most of the differences arise from conventions after decades of use.

Create a new configuration file based on the example at https://docs.python.org/3/library/configparser.html#quick-start

 1 #!/usr/bin/env python3
 2 
 3 import configparser
 4 
 5 cfg = configparser.ConfigParser()
 6 cfg.optionxform = str               # make sure to preserve case!
 7 
 8 # add the DEFAULT section
 9 cfg['DEFAULT'] = {'ServerAliveInterval': 45,
10                   'Compression': 'yes',
11                   'CompressionLevel': 9,
12                   'ForwardX11': 'yes'}
13 
14 # add a new section
15 cfg['bitbucket.org'] = {}
16 cfg['bitbucket.org']['User'] = 'hg'
17 
18 # another new section
19 cfg['topsecret.server.com'] = {}
20 topsecret = cfg['topsecret.server.com']
21 topsecret['Port'] = '50022'
22 topsecret['ForwardX11'] = 'no'
23 
24 with open('servers.ini', 'wt') as fh:
25    cfg.write(fh)

This creates the following INI file:

[DEFAULT]
ServerAliveInterval = 45
Compression = yes
CompressionLevel = 9
ForwardX11 = yes

[bitbucket.org]
User = hg

[topsecret.server.com]
Port = 50022
ForwardX11 = no

Suppose that for the reason of separating out concerns, we have decided to split up our configuration information into the following 2 files:

  • servers.ini from the earlier example, containing generic server related configuration, and
  • user.ini containing the specific preferences of a user as follows:
[DEFAULT]
ServerAliveInterval = 200
ForwardX11 = no

[www.example.com]
User = jdoe

A slightly modified version of our earlier *INI* reader example will read and merge both the servers.ini

 1 #!/usr/bin/env pythone
 2 
 3 import configparser
 4 ini = configparser.ConfigParser()
 5 ini.optionxform = str               # make sure to preserve case!
 6 files_read = ini.read(['servers.ini', 'user.ini'])
 7 cfg = { section:dict(ini[section])
 8         for section in ini.sections() + ['DEFAULT'] }
 9 print(cfg)

This program produces the following output (slightly re-formatted for readability):

python3 -i read-multiple-ini.py

{
  'bitbucket.org': {
     'User': 'hg', 'ServerAliveInterval': '200', 'Compression': 'yes',
     'CompressionLevel': '9', 'ForwardX11': 'no'},
  'topsecret.server.com': {
     'Port': '50022', 'ForwardX11': 'no', 'ServerAliveInterval': '200',
     'Compression': 'yes', 'CompressionLevel': '9'},
  'www.example.com': {
     'User': 'jdoe', 'ServerAliveInterval': '200', 'Compression': 'yes',
     'CompressionLevel': '9', 'ForwardX11': 'no'},
  'DEFAULT': {
     'ServerAliveInterval': '200', 'Compression': 'yes',
     'CompressionLevel': '9', 'ForwardX11': 'no'}
}

Note that some of the entries defined in servers.ini are overwritten by the matching entries in user.ini and there is also a new section:

  • changes in the [DEFAULT] section:
    • value change of ServerAliveInterval: 45 -> 200
    • value change of ForwardX11: yes -> no
  • new section www.example.com

For more information see the following section of the configparser module's documentation:

https://docs.python.org/3/library/configparser.html#interpolation-of-values

The JSON file format (see [JSON]) is an ECMA open standard well-suited for exchanging tree-like data in a human-readable text format. JSON is widely used for storing both configuration information and actual data.

The "Python Standard Library" provides out-of-the-box JSON support in the json modules.

Let's take the cartoon characters data from the names dictionary of our earlier example and export it to a JSON file:

 1 #!/usr/bin/env python3
 2 
 3 import json
 4 
 5 names = {
 6           'kids': {
 7                     'Chris': 'Family Guy',
 8                     'Pebbles': 'The Flintstones',
 9                     'Bart': 'The Simpsons'
10                   },
11           'adults': {
12                       'Fred': 'The Flintstones',
13                       'Betty': 'The Flintstones',
14                       'Homer': 'The Simpsons',
15                       'Lois': 'Family Guy'
16                     },
17           'other': { 'Klaus': 'American Dad',
18                      'Brian': 'Family Guy',
19                      'Roger': 'American Dad'
20                    }
21         }
22 with open('names.json', 'wt') as fh:
23    json.dump(names, fh)

Focusing on the new instructions:

  • line 3: load the json module
  • line 23: the json.dump() takes 2 arguments: the data structure (usually as a dict) and a file-handler

The program produces the "names.json" file with the following content (re-formatted for readability):

{
  "kids": {
    "Chris": "Family Guy",
    "Pebbles": "The Flintstones",
    "Bart": "The Simpsons"
  },
  "adults": {
    "Fred": "The Flintstones",
    "Betty": "The Flintstones",
    "Homer": "The Simpsons",
    "Lois": "Family Guy"
  },
  "other": {
    "Klaus": "American Dad",
    "Brian": "Family Guy",
    "Roger": "American Dad"
  }
}

Note the striking similarities in the syntax of the JSON format and the actual Python syntax of dictionaries! Almost exactly the same.

Loading a JSON file is fairly trivial with Python's json module:

1 #!/usr/bin/env pythone
2 
3 import json
4 with open('names.json') as fh:
5    names = json.load(fh)
6    print(names)

Output (re-formatted):

python3 read-names-from-json.py

{
  'kids': {
    'Chris': 'Family Guy', 'Pebbles': 'The Flintstones',
    'Bart': 'The Simpsons'
  },
  'adults': {
    'Fred': 'The Flintstones', 'Betty': 'The Flintstones',
    'Homer': 'The Simpsons', 'Lois': 'Family Guy'
  },
  'other': {
    'Klaus': 'American Dad', 'Brian': 'Family Guy', 'Roger': 'American Dad'
  }
}

In terms of purpose the [YAML] format is quite similar to JSON, except for two aspects:

  • its improved readability
  • provides richer data serialization capabilities

One of the most notable usage for of the YAML format in the Python ecosystem is the [Ansible] configuration management solution.

Currently the "Python Standard Library" does not have YAML support. Fortunately there are multiple 3rd party modules, which can be easy installed using the pip package management tool.

Perhaps the most popular solution is provided by the [PyYAML] project. To install the module execute:

pip install pyyaml

Re-using the cartoon characters data we can export the names dictionary with the following simple code:

 1 #!/usr/bin/env python3
 2 
 3 import yaml
 4 
 5 names = {
 6           'kids': {
 7                     'Chris': 'Family Guy',
 8                     'Pebbles': 'The Flintstones',
 9                     'Bart': 'The Simpsons'
10                   },
11           'adults': {
12                       'Fred': 'The Flintstones',
13                       'Betty': 'The Flintstones',
14                       'Homer': 'The Simpsons',
15                       'Lois': 'Family Guy'
16                     },
17           'other': { 'Klaus': 'American Dad',
18                      'Brian': 'Family Guy',
19                      'Roger': 'American Dad'
20                    }
21         }
22 with open('names.yaml', 'wt') as fh:
23    yaml.dump(names, fh)

Focusing on the new instructions:

  • line 3: load the yaml module
  • line 23: the yaml.dump() takes 2 arguments: the data structure (usually as a dict) and a file-handler

The program produces the "names.yaml" file with the following content (re-formatted for readability):

adults: {Betty: The Flintstones, Fred: The Flintstones, Homer: The Simpsons, Lois: Family Guy}
kids: {Bart: The Simpsons, Chris: Family Guy, Pebbles: The Flintstones}
other: {Brian: Family Guy, Klaus: American Dad, Roger: American Dad}

The YAML specifications (see [YAML_specs]) contain various simple, yet informative examples about the more advanced capabilities of the format.

---
invoice: 34843
date   : 2001-01-23
bill-to: &id001
    given  : Chris
    family : Dumars
    address:
        lines: |
            458 Walkman Dr.
            Suite #292
        city    : Royal Oak
        state   : MI
        postal  : 48046
ship-to: *id001
product:
    - sku         : BL394D
      quantity    : 4
      description : Basketball
      price       : 450.00
    - sku         : BL4438H
      quantity    : 1
      description : Super Hoop
      price       : 2392.00
tax  : 251.42
total: 4443.52
comments:
    Late afternoon is best.
    Backup contact is Nancy
    Billsmer @ 338-4338.

Let's create our usual program to read the current format, but with a few additional features:

  1. Instead of hard-coding the data file's name in the program, require the data file's name as an CLI argument
  2. Using the pprint (pretty-print) module display the data in a more readable format
  3. Try to handle errors
 1 #!/usr/bin/env pythone
 2 
 3 import sys
 4 import yaml
 5 import pprint as pp
 6 
 7 try:
 8    fh = open(sys.argv[1])
 9    data = yaml.load(fh)
10 except IndexError:
11    print('I need an argument: YAML file name')
12    sys.exit(1)
13 except FileNotFoundError:
14    print('File "{}" is not found!'.format(sys.argv[1]))
15    sys.exit(2)
16 except yaml.parser.ParserError:
17    msg = 'The file {} does not appear to be a valid YAML file!'
18    print(msg.format(sys.argv[1]))
19 pp.pprint(data)

The eXtensible Markup Language (XML) format is has been the workhorse of data exchange in the last two decades. Because of its flexibility and the rich ecosystem XML is still an often used format, especially in enterprise applications. The format is best suited for the representation of tree-like data structures. XML also provides standards and tools for the verification of the data based on schema.

Following is a first simple example of our cartoon characters data in as XML:

<character id="1000"fname="Fred" sname="Flintstone">
  <appeared_in>The Flintstones</appeared_in>
  <relations>
    <relation character_id="1001" type="child">
    <relation character_id="1002" type="spouse">
  </relations>
</character>

Python has rich XML support in the form of multiple XML related modules and projects. The two main reason for the lack of a single implementation are on the one hand the huge scope and functionality of XML, and different implementations with overlapping functionality. Out of the many a few of these XML implementations also have made it to the "Python Standard Library". (see for more [pydoc_markup_tools])

In our example we'll be using the lxml module, which is currently a popular choice. (see for more [lxml]) This module is a wrapper around the libxml2 and libxslt libraries, both written in C. (see also: http://www.xmlsoft.org/)

Execute the following to install the lxml module:

pip install lxml

NOTE: The installation of lxml require a working C compiler, which may be an issue, especially for Windows users. Please visit the "Installing lxml" page ([lxml_install]) which provides a link to the "unofficial Windows binaries".

Working with XML is significantly more complex than the previously discussed formats. To get acquainted with the API let's follow the "lxml Tutorial" (see [lxml_tutorial]) and try to code the above example manually in an interactive Python session:

1 >>> from lxml import etree
2 >>> root = etree.Element('character', attrib={'id': '1000', 'fname': 'Fred', 'sname': 'Flintstone'})

At this point the root variable is bound to an object, which represents an XML node. Let's see how the data is organized:

3 >>> type(root)                      # the object's type
4 <type 'lxml.etree._Element'>
5 >>> root.tag                        # the XML tag of this node
6 'character'
7 >>> root.attrib                     # the XML attributes of the node
8 {'fname': 'Fred', 'id': '1000', 'sname': 'Flintstone'}

The string representation of this node:

 9 >>> etree.tostring(root)
10 b'<character fname="Fred" id="1000" sname="Flintstone"/>'

WARNING: Notice the b prefix in the result of the tostring() function: the returned object is of type bytes (and not str)! If we were to attempt to write this to a file which has been opened in text mode (e.g.: "open( 'somename.xml', 'wt')"), we would get an Exception!

The simple rule here is: bytes objects can be written to binary files (i.e.: mode "wb") and str objects to text files (i.e.: mode "wt")

For a str result we need to decode the tostring() function's result:

11 >>> etree.tostring(root).decode()
12 '<character fname="Fred" id="1000" sname="Flintstone"/>'

Next, we add two child nodes representing the appeared_in and the relations tags and check the XML output:

13 >>> a = etree.Element('appeared_in') # create a new node
14 >>> b = etree.Element('relations')   # another new node
15 >>> root.append(a)                   # add the ``a`` node to root
16 >>> root.append(b)                   # add the ``b`` node to root
17 
18 >>> ### Check the XML representation of ``root``
19 >>> print(etree.tostring(root, pretty_print=True).decode())
20 <character fname="Fred" id="1000" sname="Flintstone">
21   <appeared_in/>
22   <relations/>
23 </character>
24 
25 >>>

The root element is an container object, which currently holds 2 objects. The API makes it possible to treat root as if it was a list.

26 >>> root[0].tag                     # element lookup with similar to ``list``
27 'appeared_in'
28 >>> len(root)                       # the ``len()`` works correctly
29 2
30 
31 ### with ``for`` we can iterate through the elements
32 >>> for node in root:
33 ...     print(node.tag, node.attrib)
34 ...
35 appeared_in {}
36 relations {}
37 >>>

Let's add the missing pieces:

38 ### Add the cartoon title to the ``appeared_in`` tag
39 >>> root[0].text = 'The Flintstones'
40 >>> print(etree.tostring(root, pretty_print=True).decode())
41 ### Show changes in XML representation
42 <character fname="Fred" id="1000" sname="Flintstone">
43   <appeared_in>The Flintstones</appeared_in>
44   <relations/>
45 </character>
46 
47 ### Add first relation
48 >>> r1 = etree.Element('relation', attrib={'character_id': '1002', 'type': 'spouse'})
49 ### index 1 is the earlier ``relations`` node. Add ``r1`` node to this
50 >>> root[1].append(r1)
51 ### Show XML representation
52 >>> print(etree.tostring(root, pretty_print=True).decode())
53 <character fname="Fred" id="1000" sname="Flintstone">
54   <appeared_in>The Flintstones</appeared_in>
55   <relations>
56     <relation character_id="1002" type="spouse"/>
57   </relations>
58 </character>
59 
60 ### Add second relation
61 >>> r2 = etree.Element('relation', attrib={'character_id': '1001', 'type': 'child'})
62 >>> root[1].append(r2)
63 ### Show XML representation
64 >>> print(etree.tostring(root, pretty_print=True).decode())
65 <character fname="Fred" id="1000" sname="Flintstone">
66   <appeared_in>The Flintstones</appeared_in>
67   <relations>
68     <relation character_id="1002" type="spouse"/>
69     <relation character_id="1001" type="child"/>
70   </relations>
71 </character>

At this point we can export the data in XML format:

72 >>> fh = open('names-fred-written.xml', 'wt')
73 >>> fh.write(etree.tostring(root, pretty_print=True).decode())
74 198
75 >>> fh.close()

The content of the names-fred-written.xml file looks like this:

<character fname="Fred" id="1000" sname="Flintstone">
  <appeared_in>The Flintstones</appeared_in>
  <relations>
    <relation character_id="1002" type="spouse"/>
    <relation character_id="1001" type="child"/>
  </relations>
</character>

To summarize the above interactive session, let's put the essential bits into a single Python program:

 1 #!/usr/bin/env python3
 2 from lxml import etree
 3 root = etree.Element('character',
 4                      attrib={'id': '1000',
 5                              'fname': 'Fred',
 6                              'sname': 'Flintstone'})
 7 a = etree.Element('appeared_in') # create a new node
 8 b = etree.Element('relations')   # another new node
 9 root.append(a)                   # add the ``a`` node to root
10 root.append(b)                   # add the ``b`` node to root
11 root[0].text = 'The Flintstones' # add title in node ``appeared_in``
12 r1 = etree.Element('relation', attrib={'character_id': '1002',
13                                        'type': 'spouse'})
14 root[1].append(r1)
15 r2 = etree.Element('relation', attrib={'character_id': '1001',
16                                        'type': 'child'})
17 root[1].append(r2)
18 
19 with open('names-fred-written.xml', 'wt') as fh:
20     fh.write(etree.tostring(root, pretty_print=True).decode())

Using our previous example, let's try re-create the root object from the XML file:

1 >>> fh = open('names-fred-written.xml')
2 >>> root = etree.fromstring(fh.read())
  • line 1: open the XML file in read-only and text mode, because of implied "wt" mode string
  • line 2: the step in the order as they are executed:
    • fh.read() method will read the entire content of the file into a str
    • etree.fromstring() function will convert this string to an instance of lxml.etree._Element, i.e.: the Python object representing the XML content

At this point the object bound to the root variable contains all information from the file. Using a for loop we can iterate through the elements:

3 >>> for node in root:
4 ...     tmpl.format(node.tag, len(node), node.text, node.attrib)
5 ...
6 'appeared_in          0     The Flintstones      {}'
7 'relations            2     \n                    {}'

To summarize:

1 #!/usr/bin/env python3
2 from lxml import etree
3 
4 fh = open('names-fred-written.xml')
5 root = etree.fromstring(fh.read())
6 for node in root:
7     tmpl.format(node.tag, len(node), node.text, node.attrib)

For large XML files this approach may be less efficient, since all the processing is done in the Python interpreter. A more efficient way would be to outsource the bulk of the work the underlying libxml2 and libxstl libraries (see http://www.xmlsoft.org/), which we will consider next.

Because the data that can be stored in XML format, can be large in terms of both complexity and amount, processing is usually a non-trivial task. The lxml FAQ (see [lxml_faq]) lists several considerations when processing XML data. Dealing with large XML files can be more efficient using the [xpath] technology, especially if we are only interested in certain parts.

Using an XPath search string return all nodes with the relation tag. The parsing and searching is done in the C library, instead of Python, so it's the performance is optimal:

1 >>> root.findall(".//relation")
2 [<Element relation at 0x7f923f495888>, <Element relation at 0x7f923f46b7c8>]

As above, except loop through all the relation node(s) and show their attributes.

1 >>> for e in root.findall(".//relation"): print(e.attrib)
2 ...
3 {'character_id': '1002', 'type': 'spouse'}
4 {'character_id': '1001', 'type': 'child'}

The following XPath search string will limit the results to relation nodes which have a type attribute with the value spouse.

1 >>> for e in root.findall('.//relation[@type="spouse"]'):
2 ...     print(e.attrib)
3 ...
4 {'character_id': '1002', 'type': 'spouse'}
5 >>>
[pydoc_open]Documentation of the open() function https://docs.python.org/3/library/functions.html#open
[ASCII1967]ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters. See: https://en.wikipedia.org/wiki/ASCII
[Unicode2001](1, 2) Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. https://en.wikipedia.org/wiki/Unicode
[CSVformat]A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas (',') or other delimiter characters, such as semicolon (';'), colon (':'), bar ('|'), etc... https://en.wikipedia.org/wiki/Comma-separated_values
[pydoc_csv]Python Standard Library documentation, CSV module https://docs.python.org/3/library/csv.html
[pydoc_csv_formatting]csv module's Dialects and Formatting Parameters https://docs.python.org/3/library/csv.html#csv-fmt-params
[pep343](1, 2) PEP 343 -- The "with" Statement https://www.python.org/dev/peps/pep-0343/
[INI_format]The INI file format is an informal standard for configuration files capable of representing tree-structure like information. See: https://en.wikipedia.org/wiki/INI_file
[pydoc_configparser]The configparser module implements the INI configuration file format. See https://docs.python.org/3/library/configparser.html
[pep274]Dictionary Comprehensions https://www.python.org/dev/peps/pep-0274/
[JSON]JSON is a language-independent data format. It was derived from JavaScript, but many modern programming languages include code to generate and parse JSON-format data. https://en.wikipedia.org/wiki/JSON
[YAML]YAML is a human-readable data-serialization language. It is commonly used for configuration files, but could be used in many applications where data is being stored or transmitted https://en.wikipedia.org/wiki/YAML
[Ansible]Ansible is an open-source software provisioning, configuration management, and application-deployment tool. https://en.wikipedia.org/wiki/Ansible_(software)
[PyYAML]PyYAML is a full-featured YAML framework for the Python programming language. https://pyyaml.org/
[YAML_specs]Full length examples of YAML https://yaml.org/spec/1.2/spec.html#id2761803
[lxml_install]https://lxml.de/installation.html
[lxml_tutorial]A brief overview of lxml's concepts https://lxml.de/tutorial.html
[pydoc_markup_tools]The Python Standard Library's Structured Markup Processing Tools https://docs.python.org/3/library/markup.html
[lxml]The lxml XML toolkit is a Pythonic binding for the C libraries libxml2 and libxslt. It provides a the feature completeness of the C libraries with a mostly compatible API to the well-known ElementTree API from the Python Standard Library.
[lxml_faq]The lxml Frequently Asked Questions https://lxml.de/FAQ.html
[xpath]

The XPath language provides the ability to navigate around in XML documents, selecting nodes by a variety of criteria.

For more see: