-
Notifications
You must be signed in to change notification settings - Fork 0
Creating a new Database Export to DaRUS and the Public Damast Instance
These are instructions for creating a new data export from the live database. This export is then published on DaRUS, the data repository of the University of Stuttgart. It is also used as data basis for the public version of Damast at the HU Berlin.
The data export does not contain all data in the database. Some tables' contents are not included in general, but criteria also apply to which data to include. Because of the complex nature of the data schema, the export works as follows:
- First, all data is exported from the database via a full dump.
- Then, the data that should not be included is removed.
- Further, some data is altered; for example, the Shi'ite subgroups are removed, and all evidences filed under the general Shi'ite group.
Note: The instructions here contain shell commands that need to be run.
Some of these are run on the own computer, some on the host of the Damast system that will show the DaRUS data.
To differentiate, the command line prompt on the own host will not show a user and host name.
The Damast host will be called damast
, and the user ubuntu
(a probable candidate for a VPS or IaaS-hosted Damast host).
The command line prompts will then look as follows:
# local prompt:
$
# local prompt with a Python virtualenv:
(env) $
# Damast host prompt
ubuntu@damast $
The full procedure can be automatically run using a script file, included below. The script file makes some assumptions and needs to be modified for the correct host names and passwords:
-
The host on which the live PostgreSQL database that is dumped from runs must be set in line 12. For the purpose of this script, it is assumed that the PostgreSQL server listens on port 5432 on that host. SSH is used to create a local port forwarding to connect to the database.
-
The password for the
postgres
user in that database must be set in line 13. -
Docker must be installed on the system where the script is run. The user running the script also needs to be in the
/etc/sudoers
file. -
The Python script that creates the final JSON file requires a Python
virtualenv
with thepostgres
package installed to be available underenv/
in the script directory. That can be ensured as follows:$ python -m venv env $ source env/bin/activate (env) $ pip install postgres
-
The four files described in the following subsections need to be present, the last of them being the script itself.
Once all files are present and all preconditions given, the database export can be initiated as follows:
$ ./create_dump.sh
This will probably ask for the user's password to start sudo
commands, and for the password or private key password for connecting via SSH to the database host.
The script will cache the initial database dump locally, in prod.sql
.
To ensure subsequent exports actually use the current database contents, be sure to remove the file after the successful export.
This should be called cleanup.sql
to be found by the export script.
-- This assumes a clean database export
-- remove user log stuff
DELETE FROM user_action;
DELETE FROM users;
-- drop annotation suggestion stuff
DELETE FROM annotation_suggestion;
DELETE FROM annotation_suggestion_document_state;
-- delete all persons and related data
UPDATE evidence SET person_instance_id = NULL;
DELETE FROM person_instance;
DELETE FROM person;
-- remove all evidence without the tag 'DhiMu'
WITH tag_ids AS (
SELECT id
FROM tag
WHERE tagname = 'DhiMu'
),
evidence_ids AS (
SELECT evidence_id AS id
FROM tag_evidence
WHERE tag_id IN (SELECT id FROM tag_ids)
)
DELETE FROM evidence
WHERE id NOT IN (SELECT id FROM evidence_ids);
-- remove tags 'DhiMu', 'eOC', 'Annotator beta test', 'Non-residential', 'Community'
DELETE
FROM tag
WHERE tagname IN ('DhiMu', 'Annotator beta test', 'eOC', 'Non-residential', 'Community');
-- delete evidence not from source list
WITH source_ids AS (
SELECT id
FROM source
-- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
WHERE id IN (
1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
)
),
evidence_ids AS (
SELECT DISTINCT evidence_id
FROM source_instance
WHERE source_id IN (SELECT id FROM source_ids)
)
DELETE
FROM evidence E
WHERE E.id NOT IN (SELECT evidence_id FROM evidence_ids);
-- clean up orphaned instances and annotations
-- This needs to happen before removing the sources to clean up orphaned annotations -> document -> source
WITH E AS (
SELECT time_group_id AS tgid
FROM evidence
WHERE time_group_id IS NOT NULL
)
DELETE FROM time_group TG
WHERE TG.id NOT IN (SELECT tgid FROM E);
WITH E AS (
SELECT place_instance_id AS piid
FROM evidence
WHERE place_instance_id IS NOT NULL
)
DELETE FROM place_instance PI
WHERE PI.id NOT IN (SELECT piid FROM E);
WITH E AS (
SELECT person_instance_id AS piid
FROM evidence
WHERE person_instance_id IS NOT NULL
)
DELETE FROM person_instance PI
WHERE PI.id NOT IN (SELECT piid FROM E);
WITH E AS (
SELECT religion_instance_id AS riid
FROM evidence
WHERE religion_instance_id IS NOT NULL
)
DELETE FROM religion_instance RI
WHERE RI.id NOT IN (SELECT riid FROM E);
WITH P AS (
SELECT DISTINCT annotation_id AS aid
FROM place_instance
WHERE annotation_id IS NOT NULL
),
PE AS (
SELECT DISTINCT annotation_id AS aid
FROM person_instance
WHERE annotation_id IS NOT NULL
),
R AS (
SELECT DISTINCT annotation_id AS aid
FROM religion_instance
WHERE annotation_id IS NOT NULL
),
T AS (
SELECT DISTINCT annotation_id AS aid
FROM time_group
WHERE annotation_id IS NOT NULL
)
DELETE FROM annotation A
WHERE A.id NOT IN (SELECT aid FROM P)
AND A.id NOT IN (SELECT aid FROM PE)
AND A.id NOT IN (SELECT aid FROM R)
AND A.id NOT IN (SELECT aid FROM T);
-- delete documents not from source list
WITH source_ids AS (
SELECT id
FROM source
-- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
WHERE id IN (
1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
)
)
DELETE
FROM document
WHERE source_id NOT IN (SELECT id FROM source_ids);
-- delete sources not in source list (source_instance tuples are already cascade-deleted)
DELETE
FROM source
-- https://github.com/UniStuttgart-VISUS/damast/issues/3#issuecomment-1063084272
WHERE id NOT IN (
1, 2, 3, 15, 13, 26, 5, 16, 42, 25, 19, 29, 30, 20, 4, 11, 10, 14, 63, 17,
40, 41, 34, 35, 8, 58, 57, 39, 37, 38, 43, 68, 69
);
-- delete places not used in evidences
WITH place_ids AS (
SELECT place_id AS id
FROM place_instance
)
DELETE
FROM place
WHERE id NOT IN (SELECT id FROM place_ids);
-- update religions: merge all Shiites under the main Shia religion, update
-- comments of religion instances to reflect where the entries came from, drop
-- sub-Shiite religions
WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Zayidiya' || (SELECT CASE
WHEN comment IS NULL OR comment = '' THEN ''
ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM five);
WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Isma''iliya' || (SELECT CASE
WHEN comment IS NULL OR comment = '' THEN ''
ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM seven);
WITH
five AS ( SELECT id FROM religion WHERE abbreviation = '5S' ),
seven AS ( SELECT id FROM religion WHERE abbreviation = '7S' ),
twelve AS ( SELECT id FROM religion WHERE abbreviation = '12S' ),
shia AS ( SELECT id FROM religion WHERE abbreviation = 'SHIA' )
UPDATE religion_instance
SET religion_id = (SELECT id FROM shia LIMIT 1),
comment = 'Imamiya' || (SELECT CASE
WHEN comment IS NULL OR comment = '' THEN ''
ELSE '; ' || comment
END)
WHERE religion_id IN (SELECT id FROM twelve);
WITH
rids AS ( SELECT id FROM religion WHERE abbreviation IN ('5S', '7S', '12S' ))
DELETE FROM religion
WHERE id IN ( SELECT id FROM rids );
This should be called to-json.py
to be found by the export script.
#!/usr/bin/env python3
import json
import os
from postgres import Postgres
from postgres.cursors import SimpleDictCursor
from psycopg2.extras import NumericRange
jsonfile = 'database.json'
ignore_tables = [
'action_type',
'annotation_suggestion_document_state',
'place_time_range',
'bishopric_type',
'bishopric_place',
'bishopric',
'place_set',
'user_action',
'advanced_religion_filter',
'religion_filter_group_set',
'religion_filter_group',
'spatial_ref_sys',
'bishopric_residence',
'person_bishopric',
'source_bishopric',
'annotation_suggestion',
'external_database',
]
class _Encoder(json.JSONEncoder):
def default(self, o):
if isinstance(o, NumericRange):
if o.isempty:
return None
lower = None if o.lower_inf else o.lower if o.lower_inc else o.lower + 1
upper = None if o.upper_inf else o.upper if o.upper_inc else o.upper - 1
return dict(type='int4range', lower=lower, upper=upper)
if isinstance(o, memoryview):
return bytes(o).decode('utf-8')
return json.JSONEncoder.default(self, o)
conn = Postgres('user=postgres password=postgres dbname=ocn host=localhost', cursor_factory=SimpleDictCursor)
with conn.get_cursor() as cur:
cur.run('''SELECT table_name
FROM information_schema.tables
WHERE table_schema = %(schema)s
AND table_type = %(type)s
AND NOT (table_name = ANY(%(ignore)s));''',
schema = 'public',
type = 'BASE TABLE',
ignore = ignore_tables)
data = dict()
tables = list(map(lambda t: t['table_name'], cur.fetchall()))
for table in tables:
vs = cur.all(F'SELECT * FROM {table} ORDER BY id;')
data[table] = vs
with open(jsonfile, 'w') as f:
json.dump(data, f, indent=2, cls=_Encoder)
A directory init.d/
should exist in the directory the script is executed in.
In it, a script file, named, for example, init-user-db.sh
should be placed with the following contents:
#!/bin/bash
set -e
psql -v ON_ERROR_STOP=1 --username "$POSTGRES_USER" --dbname "$POSTGRES_DB" <<-EOSQL
CREATE USER users;
CREATE USER api;
CREATE USER ro_dump;
EOSQL
Finally, the export script itself.
This can have any name, but is called create_dump.sh
in this documentation.
#!/bin/bash
set -euo pipefail
proddump=prod.sql
dockername=local-damast-db
finaldump=database.sql
schemadump=schema.sql
datazip=data.zip
# XXX: set the correct host and postgres-password to dump from
databasehost=dummy.host-name.org
databasepassword="foo bar"
log() {
printf "\n\033[1;34m%s\033[0m\n" "$1"
}
# dump production DB if no dump present
if [ ! -f $proddump ]
then
set +e
ssh -NfL 15432:localhost:5432 $databasehost
set -e
log "Dumping production database to $proddump ..."
PGPASSWORD="$databasepassword" \
pg_dump \
-U postgres \
-d ocn \
-h localhost \
-p 15432 \
--clean \
--create \
--if-exists \
--no-password \
| sed "s/LOCALE = 'en_GB.UTF-8'/LC_COLLATE = 'en_US.UTF-8'/" \
| sed "s/LC_CTYPE = 'en_GB.UTF-8'/LC_CTYPE = 'en_US.UTF-8'/" \
| grep -v "REVOKE .* \"27946\"" \
> $proddump
ls -l $proddump
fi
# set up database
sudo docker stop $dockername 2>/dev/null || true
log "Starting PostGIS database container ..."
sudo docker run --rm \
--detach \
--name local-damast-db \
--net=host \
-e POSTGRES_PASSWORD=postgres \
-v $(pwd)/init.d/:/docker-entrypoint-initdb.d:z \
postgis/postgis:10-3.1
# wait for the local server to start
while ! pg_isready -h localhost; do sleep 1; done
log "Loading database dump into PostGIS database ..."
PGPASSWORD=postgres psql \
-h localhost \
-U postgres \
--no-password \
-f $proddump
log "Dropping unneeded data ..."
PGPASSWORD=postgres psql \
-h localhost \
-U postgres \
-d ocn \
--no-password \
-f cleanup.sql
log "Creating final dump ..."
PGPASSWORD=postgres pg_dump \
-U postgres \
-d ocn \
-h localhost \
--clean \
--create \
--if-exists \
--no-password \
> $finaldump
log "Creating schema SQL file ..."
PGPASSWORD=postgres pg_dump \
-U postgres \
-d ocn \
-h localhost \
--create \
--schema-only \
> $schemadump
log "Creating JSON dump ..."
(
source env/bin/activate
python to-json.py
)
log "Terminating PostGIS database container ..."
sudo docker stop $dockername
log "Creating data ZIP file ..."
rm -f $datazip
zip -r $datazip database.json schema.sql database.sql
The export script should now have produced three files of relevance:
-
database.sql
: The exported data as an SQL file. -
schema.sql
: The database schema without the data. -
database.json
: The exported data in JSON format.
The three files are also present within a ZIP file, data.zip
.
To upload the new version to DaRUS, simply replace the three files with the same name in the DaRUS dataset after clicking Edit dataset.
Afterwards, edit the metadata appropriately, update the links to the files in the description (the fileId
s should have changed), then submit the new version for review.
Connect to the Damast instance where the DaRUS data should be visible.
-
To be sure, make a backup of important data (configuration, report database, and user database).
-
Copy the
database.sql
file created by the export script onto the Damast host system. For our purposes, the file is then located on the host system under/tmp/database.sql
. -
Then, stop the Damast server:
ubuntu@damast $ sudo systemctl stop damast
The database.sql
file should contain all needed instructions to replace the current database contents.
To apply it, run the following command on the host system:
ubuntu@damast $ psql -h localhost -U postgres -d postgres -f /tmp/database.sql
This command should ask for the database password for postgres
, and then start applying the database contents.
This will take a few minutes.
The last step is to ensure that all subsequent reports reference the correct version of the data in DaRUS.
The SQLite3 report database contains a table database_version
for this purpose.
This table has an incrementing version number, a data, the URL associated with the dataset version, and a description.
Note that the version number in the table is just an integer and does not directly relate to the DaRUS dataset versioning scheme.
To add the new version, execute the SQL statement below in the report database.
WITH last_version AS (
SELECT version
FROM database_version
ORDER BY version DESC
LIMIT 1
)
INSERT INTO database_version (
version,
date,
url,
description
) VALUES (
(SELECT version+1 FROM last_version),
date('now'),
-- set correct version (URL) here:
'https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-2318&version=1.0',
-- put correct update description here (what changed since the last version?):
'...'
);
Be sure to put the correct URL (to the new version) and description in the statement. The description should explain the purpose of the new version: What changed since the last one? Which special criteria or events made the new version necessary? Examples for good descriptions:
- "Initial DaRUS export, including only DhiMu data"
- "Added historical evidence from source X"
- "Included evidence for religions X and Y in region Z"
-
The database SQL script can be removed on the host system:
ubuntu@damast $ rm /tmp/database.sql
-
On the system where the export script was run, the database dump should be removed to ensure the export script uses a fresh dump next time:
$ rm prod.sql
-
Finally, the Damast server needs to be started again on the host system:
ubuntu@damast $ sudo systemctl start damast
Afterwards, check whether the web pages load correctly, the data is shown in the visualization, and report generation works as expected.