Creating a new Database Export to DaRUS and the Public Damast Instance

These are instructions for creating a new data export from the live database. This export is then published on DaRUS, the data repository of the University of Stuttgart. It is also used as data basis for the public version of Damast at the HU Berlin.

Creating the Data Export

The data export does not contain all data in the database. Some tables' contents are not included in general, but criteria also apply to which data to include. Because of the complex nature of the data schema, the export works as follows:

First, all data is exported from the database via a full dump.
Then, the data that should not be included is removed.
Further, some data is altered; for example, the Shi'ite subgroups are removed, and all evidences filed under the general Shi'ite group.

Note: The instructions here contain shell commands that need to be run. Some of these are run on the own computer, some on the host of the Damast system that will show the DaRUS data. To differentiate, the command line prompt on the own host will not show a user and host name. The Damast host will be called damast, and the user ubuntu (a probable candidate for a VPS or IaaS-hosted Damast host). The command line prompts will then look as follows:

# local prompt:
$
# local prompt with a Python virtualenv:
(env) $
# Damast host prompt
ubuntu@damast $

The full procedure can be automatically run using a script file, included below. The script file makes some assumptions and needs to be modified for the correct host names and passwords:

The host on which the live PostgreSQL database that is dumped from runs must be set in line 12. For the purpose of this script, it is assumed that the PostgreSQL server listens on port 5432 on that host. SSH is used to create a local port forwarding to connect to the database.
The password for the postgres user in that database must be set in line 13.
Docker must be installed on the system where the script is run. The user running the script also needs to be in the /etc/sudoers file.
The Python script that creates the final JSON file requires a Python virtualenv with the postgres package installed to be available under env/ in the script directory. That can be ensured as follows:
```
$ python -m venv env
$ source env/bin/activate
(env) $ pip install postgres
```
The four files described in the following subsections need to be present, the last of them being the script itself.

Once all files are present and all preconditions given, the database export can be initiated as follows:

$ ./create_dump.sh

This will probably ask for the user's password to start sudo commands, and for the password or private key password for connecting via SSH to the database host. The script will cache the initial database dump locally, in prod.sql. To ensure subsequent exports actually use the current database contents, be sure to remove the file after the successful export.

The SQL Cleanup Script

This should be called cleanup.sql to be found by the export script.

-- This assumes a clean database export

-- remove user log stuff
DELETE FROM user_action;
DELETE FROM users;


-- drop annotation suggestion stuff
DELETE FROM annotation_suggestion;
DELETE FROM annotation_suggestion_document_state;


-- delete all persons and related data
UPDATE evidence SET person_instance_id = NULL;
DELETE FROM person_instance;
DELETE FROM person;


-- remove all evidence without the tag 'DhiMu'
WITH tag_ids AS (
  SELECT id
  FROM tag
  WHERE tagname = 'DhiMu'
),
evidence_ids AS (
  SELECT evidence_id AS id
  FROM tag_evidence
  WHERE tag_id IN (SELECT id FROM tag_ids)
)
DELETE FROM evidence
WHERE id NOT IN (SELECT id FROM evidence_ids);


-- remove tags 'DhiMu', 'eOC', 'Annotator beta test', 'Non-residential', 'Community'
DELETE
FROM tag
WHERE tagname IN ('DhiMu', 'Annotator beta test', 'eOC', 'Non-residential', 'Community');


-- delete evidence not from source list
WITH source_ids AS (
  SELECT id
  FROM source
  -- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
  WHERE id IN (
    1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
    40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
  )
),
evidence_ids AS (
  SELECT DISTINCT evidence_id
  FROM source_instance
  WHERE source_id IN (SELECT id FROM source_ids)
)
DELETE
FROM evidence E
WHERE E.id NOT IN (SELECT evidence_id FROM evidence_ids);


-- clean up orphaned instances and annotations
--   This needs to happen before removing the sources to clean up orphaned annotations -> document -> source
WITH E AS (
  SELECT time_group_id AS tgid
  FROM evidence
  WHERE time_group_id IS NOT NULL
)
DELETE FROM time_group TG
WHERE TG.id NOT IN (SELECT tgid FROM E);

WITH E AS (
  SELECT place_instance_id AS piid
  FROM evidence
  WHERE place_instance_id IS NOT NULL
)
DELETE FROM place_instance PI
WHERE PI.id NOT IN (SELECT piid FROM E);

WITH E AS (
  SELECT person_instance_id AS piid
  FROM evidence
  WHERE person_instance_id IS NOT NULL
)
DELETE FROM person_instance PI
WHERE PI.id NOT IN (SELECT piid FROM E);

WITH E AS (
  SELECT religion_instance_id AS riid
  FROM evidence
  WHERE religion_instance_id IS NOT NULL
)
DELETE FROM religion_instance RI
WHERE RI.id NOT IN (SELECT riid FROM E);

WITH P AS (
  SELECT DISTINCT annotation_id AS aid
  FROM place_instance
  WHERE annotation_id IS NOT NULL
),
PE AS (
  SELECT DISTINCT annotation_id AS aid
  FROM person_instance
  WHERE annotation_id IS NOT NULL
),
R AS (
  SELECT DISTINCT annotation_id AS aid
  FROM religion_instance
  WHERE annotation_id IS NOT NULL
),
T AS (
  SELECT DISTINCT annotation_id AS aid
  FROM time_group
  WHERE annotation_id IS NOT NULL
)
DELETE FROM annotation A
WHERE A.id NOT IN (SELECT aid FROM P)
  AND A.id NOT IN (SELECT aid FROM PE)
  AND A.id NOT IN (SELECT aid FROM R)
  AND A.id NOT IN (SELECT aid FROM T);


-- delete documents not from source list
WITH source_ids AS (
  SELECT id
  FROM source
  -- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
  WHERE id IN (
    1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
    40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
  )
)
DELETE
FROM document
WHERE source_id NOT IN (SELECT id FROM source_ids);


-- delete sources not in source list (source_instance tuples are already cascade-deleted)
DELETE
FROM source
-- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
WHERE id NOT IN (
  1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
  40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
);


-- delete places not used in evidences
WITH place_ids AS (
  SELECT place_id AS id
  FROM place_instance
)
DELETE
FROM place
WHERE id NOT IN (SELECT id FROM place_ids);


-- update religions: merge all Shiites under the main Shia religion, update
-- comments of religion instances to reflect where the entries came from, drop
-- sub-Shiite religions
WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Zayidiya' || (SELECT CASE
  WHEN comment IS NULL OR comment = '' THEN ''
  ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM five);

WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Isma''iliya' || (SELECT CASE
  WHEN comment IS NULL OR comment = '' THEN ''
  ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM seven);

WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Imamiya' || (SELECT CASE
  WHEN comment IS NULL OR comment = '' THEN ''
  ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM twelve);

WITH
rids AS ( SELECT id FROM religion WHERE abbreviation IN ('5S', '7S', '12S' ))
DELETE FROM religion
WHERE id IN ( SELECT id FROM rids );

The JSON-creating Python Script

This should be called to-json.py to be found by the export script.

#!/usr/bin/env python3

import json
import os
from postgres import Postgres
from postgres.cursors import SimpleDictCursor
from psycopg2.extras import NumericRange

jsonfile = 'database.json'

ignore_tables = [
        'action_type',
        'annotation_suggestion_document_state',
        'place_time_range',
        'bishopric_type',
        'bishopric_place',
        'bishopric',
        'place_set',
        'user_action',
        'advanced_religion_filter',
        'religion_filter_group_set',
        'religion_filter_group',
        'spatial_ref_sys',
        'bishopric_residence',
        'person_bishopric',
        'source_bishopric',
        'annotation_suggestion',
        'external_database',
        ]


class _Encoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, NumericRange):
            if o.isempty:
                return None

            lower = None if o.lower_inf else o.lower if o.lower_inc else o.lower + 1
            upper = None if o.upper_inf else o.upper if o.upper_inc else o.upper - 1

            return dict(type='int4range', lower=lower, upper=upper)

        if isinstance(o, memoryview):
            return bytes(o).decode('utf-8')

        return json.JSONEncoder.default(self, o)


conn = Postgres('user=postgres password=postgres dbname=ocn host=localhost', cursor_factory=SimpleDictCursor)
with conn.get_cursor() as cur:
    cur.run('''SELECT table_name
                   FROM information_schema.tables
                   WHERE table_schema = %(schema)s
                   AND table_type = %(type)s
                   AND NOT (table_name = ANY(%(ignore)s));''',
                   schema = 'public',
                   type = 'BASE TABLE',
                   ignore = ignore_tables)

    data = dict()
    tables = list(map(lambda t: t['table_name'], cur.fetchall()))
    for table in tables:
        vs = cur.all(F'SELECT * FROM {table} ORDER BY id;')

        data[table] = vs

    with open(jsonfile, 'w') as f:
        json.dump(data, f, indent=2, cls=_Encoder)

The Initialization Directory for the Local Docker Database

A directory init.d/ should exist in the directory the script is executed in. In it, a script file, named, for example, init-user-db.sh should be placed with the following contents:

#!/bin/bash
set -e

psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
    CREATE USER users;
    CREATE USER api;
    CREATE USER ro_dump;
EOSQL

The Export Script

Finally, the export script itself. This can have any name, but is called create_dump.sh in this documentation.

#!/bin/bash

set -euo pipefail

proddump=prod.sql
dockername=local-damast-db
finaldump=database.sql
schemadump=schema.sql
datazip=data.zip

# XXX: set the correct host and postgres-password to dump from
databasehost=dummy.host-name.org
databasepassword="foo bar"

log() {
  printf "\n\033[1;34m%s\033[0m\n" "$1"
}

# dump production DB if no dump present
if [ ! -f $proddump ]
then
  set +e
  ssh -NfL 15432:localhost:5432 $databasehost
  set -e

  log "Dumping production database to $proddump ..."

  PGPASSWORD="$databasepassword" \
    pg_dump \
    -U postgres \
    -d ocn \
    -h localhost \
    -p 15432 \
    --clean \
    --create \
    --if-exists \
    --no-password \
    | sed "s/LOCALE = 'en_GB.UTF-8'/LC_COLLATE = 'en_US.UTF-8'/" \
    | sed "s/LC_CTYPE = 'en_GB.UTF-8'/LC_CTYPE = 'en_US.UTF-8'/" \
    | grep -v "REVOKE .* \"27946\"" \
    > $proddump

  ls -l $proddump
fi


# set up database
sudo docker stop $dockername 2>/dev/null || true

log "Starting PostGIS database container ..."
sudo docker run --rm \
  --detach \
  --name local-damast-db \
  --net=host \
  -e POSTGRES_PASSWORD=postgres \
  -v $(pwd)/init.d/:/docker-entrypoint-initdb.d:z \
  postgis/postgis:10-3.1


# wait for the local server to start
while ! pg_isready -h localhost; do sleep 1; done


log "Loading database dump into PostGIS database ..."
PGPASSWORD=postgres psql \
  -h localhost \
  -U postgres \
  --no-password \
  -f $proddump


log "Dropping unneeded data ..."
PGPASSWORD=postgres psql \
  -h localhost \
  -U postgres \
  -d ocn \
  --no-password \
  -f cleanup.sql


log "Creating final dump ..."
PGPASSWORD=postgres pg_dump \
  -U postgres \
  -d ocn \
  -h localhost \
  --clean \
  --create \
  --if-exists \
  --no-password \
  > $finaldump


log "Creating schema SQL file ..."
PGPASSWORD=postgres pg_dump \
  -U postgres \
  -d ocn \
  -h localhost \
  --create \
  --schema-only \
  > $schemadump


log "Creating JSON dump ..."
(
  source env/bin/activate
  python to-json.py
)

log "Terminating PostGIS database container ..."
sudo docker stop $dockername


log "Creating data ZIP file ..."
rm -f $datazip
zip -r $datazip database.json schema.sql database.sql

Uploading the new Dataset to DaRUS

The export script should now have produced three files of relevance:

database.sql: The exported data as an SQL file.
schema.sql: The database schema without the data.
database.json: The exported data in JSON format.

The three files are also present within a ZIP file, data.zip. To upload the new version to DaRUS, simply replace the three files with the same name in the DaRUS dataset after clicking Edit dataset. Afterwards, edit the metadata appropriately, update the links to the files in the description (the fileIds should have changed), then submit the new version for review.

Importing the new Dataset into Damast

Connect to the Damast instance where the DaRUS data should be visible.

Preparation

To be sure, make a backup of important data (configuration, report database, and user database).
Copy the database.sql file created by the export script onto the Damast host system. For our purposes, the file is then located on the host system under /tmp/database.sql.

Then, stop the Damast server:

ubuntu@damast $ sudo systemctl stop damast

Loading the Data into PostgreSQL

The database.sql file should contain all needed instructions to replace the current database contents. To apply it, run the following command on the host system:

ubuntu@damast $ psql -h localhost -U postgres -d postgres -f /tmp/database.sql

This command should ask for the database password for postgres, and then start applying the database contents. This will take a few minutes.

Updating the Report Database

The last step is to ensure that all subsequent reports reference the correct version of the data in DaRUS. The SQLite3 report database contains a table database_version for this purpose. This table has an incrementing version number, a data, the URL associated with the dataset version, and a description. Note that the version number in the table is just an integer and does not directly relate to the DaRUS dataset versioning scheme. To add the new version, execute the SQL statement below in the report database.

WITH last_version AS (
  SELECT version
  FROM database_version
  ORDER BY version DESC
  LIMIT 1
)
INSERT INTO database_version (
  version,
  date,
  url,
  description
) VALUES (
  (SELECT version+1 FROM last_version),
  date('now'),
  -- set correct version (URL) here:
  'https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-2318&version=1.0',
  -- put correct update description here (what changed since the last version?):
  '...'
);

Be sure to put the correct URL (to the new version) and description in the statement. The description should explain the purpose of the new version: What changed since the last one? Which special criteria or events made the new version necessary? Examples for good descriptions:

"Initial DaRUS export, including only DhiMu data"
"Added historical evidence from source X"
"Included evidence for religions X and Y in region Z"

Finishing Up

The database SQL script can be removed on the host system:
```
ubuntu@damast $ rm /tmp/database.sql
```
On the system where the export script was run, the database dump should be removed to ensure the export script uses a fresh dump next time:
```
$ rm prod.sql
```
Finally, the Damast server needs to be started again on the host system:
```
ubuntu@damast $ sudo systemctl start damast
```

Afterwards, check whether the web pages load correctly, the data is shown in the visualization, and report generation works as expected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly