Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QGOV changes, #4 #11

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
71f6787
[QOL-9122] general cleanup
ThrawnCA Jul 22, 2022
a8ebeb4
[QOL-9122] replace six.BytesIO and six.StringIO with io.BytesIO
ThrawnCA Jul 22, 2022
e2883d8
[QOL-9122] introduce 'ckan_cli' to handle both Paster and Click
ThrawnCA Jul 22, 2022
3ef8a33
[QOL-9122] pin version of ckanext-scheming
ThrawnCA Jul 22, 2022
870a96f
[QOL-9122] add Flake8 rules and make them pass
ThrawnCA Jul 22, 2022
0844e28
[QOL-9122] use Flake8 config in CI instead of manually specifying
ThrawnCA Jul 22, 2022
5cf0557
[QOL-9122] move plugin module to higher level
ThrawnCA Jul 22, 2022
43b1fa7
[QOL-9122] use cgi-based mock file storage for testing on CKAN < 2.9
ThrawnCA Jul 22, 2022
dfc5683
[QOL-9122] ensure URL package IDs match resource IDs
ThrawnCA Jul 22, 2022
2af8bc9
[QOL-9122] mock core enqueue function instead of ours
ThrawnCA Jul 22, 2022
08e07fe
[QOL-9122] allow job consumer to accept resource IDs
ThrawnCA Jul 22, 2022
c556ffd
[QOL-9122] update README
ThrawnCA Jul 22, 2022
cfa3d15
[QOL-9122] skip validating untouched resources in a package
ThrawnCA Jul 22, 2022
e353f9e
[QOL-9122] retrieve package ID if not provided on resource update
ThrawnCA Jul 22, 2022
661a371
[QOL-9122] shrink validation jobs
ThrawnCA Jul 22, 2022
1bb3617
[QOL-9122] add title to the validation link
ThrawnCA Jul 25, 2022
1f61260
[QOL-9122] use variable to manage ckan_cli path in one place
ThrawnCA Jul 25, 2022
6f1709c
[QOL-9122] use ValidationStatusHelper to simplify database access
ThrawnCA Jul 25, 2022
4c977db
[QOL-9122] mark broken tests to be skipped instead of commenting them…
ThrawnCA Jul 25, 2022
aa188bc
[QOL-9122] add unit test for attempting to run validation without suf…
ThrawnCA Jul 25, 2022
b85d8dd
[QOL-9122] add template helper to reliably detect CKAN 2.9+
ThrawnCA Jul 25, 2022
79a1668
[QOL-9122] add test for avoiding unnecessary validation
ThrawnCA Jul 25, 2022
6bdf4ab
[QOL-9122] make custom actions more robust
ThrawnCA Jul 25, 2022
013c17c
[QOL-9122] skip validation on package_patch API
ThrawnCA Jul 25, 2022
fcf353f
[QOL-9122] drop TODO that we don't need to implement
ThrawnCA Jul 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
[flake8]
# @see https://flake8.pycqa.org/en/latest/user/configuration.html?highlight=.flake8

exclude =
ckan
scripts

# Extended output format.
format = pylint

# Show the source of errors.
show_source = True
statistics = True

max-complexity = 10
max-line-length = 127

# List ignore rules one per line.
ignore =
E501
C901
W503
19 changes: 8 additions & 11 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
- name: Install requirements
run: pip install flake8 pycodestyle
- name: Check syntax
run: flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics --exclude ckan
run: flake8

test:
needs: lint
Expand Down Expand Up @@ -52,16 +52,13 @@ jobs:
pip install -e .
# Replace default path to CKAN core config file with the one on the container
sed -i -e 's/use = config:.*/use = config:\/srv\/app\/src\/ckan\/test-core.ini/' test.ini
- name: Setup extension (CKAN >= 2.9)
if: ${{ matrix.ckan-version != '2.7' && matrix.ckan-version != '2.8' }}
- name: Setup extension
run: |
ckan -c test.ini db init
ckan -c test.ini validation init-db
- name: Setup extension (CKAN < 2.9)
if: ${{ matrix.ckan-version == '2.7' || matrix.ckan-version == '2.8' }}
run: |
paster --plugin=ckan db init -c test.ini
paster --plugin=ckanext-validation validation init-db -c test.ini
CKAN_CLI=bin/ckan_cli
chmod u+x $CKAN_CLI
export CKAN_INI=test.ini
$CKAN_CLI db init
PASTER_PLUGIN=ckanext-validation $CKAN_CLI validation init-db
- name: Run tests
run: pytest --ckan-ini=test.ini --cov=ckanext.validation --cov-report=xml --cov-append --disable-warnings ckanext/validation/tests
- name: Coveralls
Expand All @@ -77,4 +74,4 @@ jobs:
- name: Coveralls Finished
uses: AndreMiras/coveralls-python-action@develop
with:
parallel-finished: true
parallel-finished: true
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
Expand All @@ -24,6 +23,8 @@ wheels/
*.egg-info/
.installed.cfg
*.egg
*.eggs
src/

# PyInstaller
# Usually these files are written by a python script from a template
Expand Down
59 changes: 34 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ Data description and validation for CKAN with [Frictionless Data](https://fricti

## Overview

This extension brings data validation powered by the [goodtables](https://github.com/frictionlessdata/goodtables-py) library to CKAN. It provides out of the box features to validate tabular data and integrate validation reports to the CKAN interface.
This extension brings data validation powered by the [Goodtables](https://github.com/frictionlessdata/goodtables-py) library to CKAN. It provides out-of-the-box features to validate tabular data and integrate validation reports to the CKAN interface.

Data validation can be performed automatically on the background or during dataset creation, and the results are stored against each resource.

!['Status badges in resources'](https://i.imgur.com/9VIzfwo.png)

Comprehensive reports are created describing issues found with the data, both at the structure level (missing headers, blank rows, etc) and at the data schema level (wrong data types, values out of range etc).
Comprehensive reports are created describing issues found with the data, both at the structure level (missing headers, blank rows, etc) and at the data schema level (wrong data types, values out of range, etc).


The extension also exposes all the underlying [actions](#action-functions) so data validation can be integrated in custom workflows from other extensions.
Expand All @@ -62,12 +62,18 @@ If you want to use [asynchronous validation](#asynchronous-validation) with back

To install ckanext-validation, activate your CKAN virtualenv and run:

git clone https://github.com/frictionlessdata/ckanext-validation.git
git clone https://github.com/keitaroinc/ckanext-validation.git
cd ckanext-validation
pip install -r requirements.txt
python setup.py develop

Create the database tables running:
Or:

pip install -e 'git+https://github.com/keitaroinc/ckanext-validation.git#egg=ckanext-validation'
cd ckanext-validation
pip install -r requirements.txt

Create the database tables by running:

ON CKAN >= 2.9:

Expand All @@ -80,23 +86,26 @@ ON CKAN <= 2.8:

## Configuration

Once installed, add the `validation` plugin to the `ckan.plugins` configuration option on your INI file:
Once installed, add the `validation` plugin to the `ckan.plugins` configuration option in your INI file:

ckan.plugins = ... validation

*Note:* if using CKAN 2.6 or lower and the [asynchronous validation](#asynchronous-validation) also add the `rq` plugin ([see Versions supported and requirements](#versions-supported-and-requirements)) to `ckan.plugins`.
*Note:* if using CKAN 2.6 or lower and [asynchronous validation](#asynchronous-validation), also add the `rq` plugin ([see Versions supported and requirements](#versions-supported-and-requirements)) to `ckan.plugins`.

### Adding schema fields to the Resource metadata

The extension requires changes in the CKAN metadata schema. The easisest way to add those is by using ckanext-scheming. Use these two configuration options to link to the dataset schema (replace with your own if you need to customize it) and the required presets:
The extension requires changes in the CKAN metadata schema. The easiest way to add those is by using ckanext-scheming. Use these two configuration options to link to the dataset schema (replace with your own if you need to customize it) and the required presets:

scheming.dataset_schemas = ckanext.validation.examples:ckan_default_schema.json
scheming.presets = ckanext.scheming:presets.json
ckanext.validation:presets.json

Read more below about to [change the CKAN metadata schema](#changes-in-the-metadata-schema)
Read more below about how to [change the CKAN metadata schema](#changes-in-the-metadata-schema)

### Operation modes
Use the following to configure which queue async jobs are added to

ckanext.validation.queue = bulk (Defaults to default)

Use the following configuration options to choose the [operation modes](#operation-modes):

Expand All @@ -108,7 +117,7 @@ Use the following configuration options to choose the [operation modes](#operati

### Formats to validate

By default validation will be run agaisnt the following formats: `CSV`, `XLSX` and `XLS`. You can modify these formats using the following option:
By default validation will be run against the following formats: `CSV`, `XLSX` and `XLS`. You can modify these formats using the following option:

ckanext.validation.formats = csv xlsx

Expand All @@ -120,7 +129,7 @@ You can also provide [validation options](#validation-options) that will be used

Make sure to use indentation if the value spans multiple lines otherwise it won't be parsed.

If you are using a cloud-based storage backend for uploads check [Private datasets](#private-datasets) for other configuration settings that might be relevant.
If you are using a cloud-based storage backend for uploads, check [Private datasets](#private-datasets) for other configuration settings that might be relevant.

### Display badges

Expand All @@ -133,13 +142,13 @@ To prevent the extension from adding the validation badges next to the resources

### Data Validation

CKAN users will be familiar with the validation performed against the metadata fields when creating or updating datasets. The form will return an error for instance if a field is missing or it doesn't have the expected format.
CKAN users will be familiar with the validation performed against the metadata fields when creating or updating datasets. The form will return an error, for instance, if a field is missing or it doesn't have the expected format.

Data validation follows the same principle but against the actual data published in CKAN, that is the contents of tabular files (Excel, CSV, etc) hosted in CKAN itself or elsewhere. Whenever a resource of the appropiate format is created or updated, the extension will validate the data against a collection of checks. This validation is powered by [goodtables](https://github.com/frictionlessdata/goodtables-py), a very powerful data validation library developed by [Open Knowledge International](https://okfn.org) as part of the [Frictionless Data](https://frictionlessdata.io) project. Goodtables provides an extensive suite of [checks](https://github.com/frictionlessdata/goodtables-py#checks) that cover common issues with tabular data files.
Data validation follows the same principle, but against the actual data published in CKAN, that is the contents of tabular files (Excel, CSV, etc) hosted in CKAN itself or elsewhere. Whenever a resource of the appropriate format is created or updated, the extension will validate the data against a collection of checks. This validation is powered by [Goodtables](https://github.com/frictionlessdata/goodtables-py), a very powerful data validation library developed by [Open Knowledge International](https://okfn.org) as part of the [Frictionless Data](https://frictionlessdata.io) project. Goodtables provides an extensive suite of [checks](https://github.com/frictionlessdata/goodtables-py#checks) that cover common issues with tabular data files.

These checks include structural problems like missing headers or values, blank rows, etc., but also can validate the data contents themselves (see [Data Schemas](#data-schemas)) or even run [custom checks](https://github.com/frictionlessdata/goodtables-py#custom-constraint).

The result of this validation is a JSON report. This report contains all the issues found (if any) with their relevant context (row number, columns involved, etc). The reports are stored in the database and linked to the CKAN resource, and can be retrieved [via the API](#resource_validation_show).
The result of this validation is a JSON report. This report contains all the issues found (if any) with their relevant context (row number, columns involved, etc). The reports are stored in the database and linked to the CKAN resources, and can be retrieved [via the API](#resource_validation_show).

If there is a report available for a particular resource, a status badge will be displayed in the resource listing and on the resource page, showing whether validation passed or failed for the resource.

Expand All @@ -149,11 +158,11 @@ Clicking on the badge will take you to the validation report page, where the rep

!['Validation report'](https://i.imgur.com/Mm6vKFD.png)

Whenever possible, the report will provide a preview of the cells, rows or columns involved in an error, to make easy to identify and fix it.
Whenever possible, the report will provide a preview of the cells, rows or columns involved in an error, to make it easy to identify and fix it.

### Data Schema

As we mentioned before, data can be validated against a schema. Much in the same way that the standard CKAN schema for metadata fields, the schema describes the data and what its values are expected to be.
As mentioned before, data can be validated against a schema. Much in the same way as the standard CKAN schema for metadata fields, the schema describes the data and what its values are expected to be.

These schemas are defined following the [Table Schema](http://frictionlessdata.io/specs/table-schema/) specification, a really simple and flexible standard for describing tabular data.

Expand Down Expand Up @@ -211,7 +220,7 @@ The following schema describes the expected data:

```

If we store this schema agaisnt a resource, it will be used to perform a more thorough validation. For instance, updating the resource with the following data would fail validation with a variety of errors, even if the general structure of the file is correct:
If we store this schema against a resource, it will be used to perform a more thorough validation. For instance, updating the resource with the following data would fail validation with a variety of errors, even if the general structure of the file is correct:


| id | location | date | measurement | observations |
Expand All @@ -225,7 +234,7 @@ With the extension enabled and configured, schemas can be attached to the `schem

### Validation Options

As we saw before, the validation process involves many different checks and it's very likely that what "valid" data actually means will vary across CKAN instances or datasets. The validation process can be tweaked by passing any of the [supported options](https://github.com/frictionlessdata/goodtables-py#validatesource-options) on goodtables. These can be used to add or remove specific checks, control limits, etc.
As we saw before, the validation process involves many different checks and it's very likely that what "valid" data actually means will vary across CKAN instances or datasets. The validation process can be tweaked by passing any of the [supported options](https://github.com/frictionlessdata/goodtables-py#validatesource-options) to Goodtables. These can be used to add or remove specific checks, control limits, etc.

For instance, the following file would fail validation using the default options, but it may be valid in a given context, or the issues may be known to the publishers:

Expand Down Expand Up @@ -263,7 +272,7 @@ Validation can be performed on private datasets. When validating a locally uploa

In these cases, the API key for the site user will be passed as part of the request (or alternatively `ckanext.validation.pass_auth_header_value` if set in the configuration).

As this involves sending API keys to other extensions this behaviour can be turned off by setting `ckanext.validation.pass_auth_header` to `False`.
As this involves sending API keys to other extensions, this behaviour can be turned off by setting `ckanext.validation.pass_auth_header` to `False`.

Again, these settings only affect private resources when using a cloud-based backend.

Expand All @@ -275,7 +284,7 @@ The data validation process described above can be run in two modes: asynchronou

#### Asynchronous validation

Asynchronous validation is run in the background whenever a resource of a supported format is created or updated. Validation won't affect the action performed, so if there are validation errors found the reource will be created or updated anyway.
Asynchronous validation is run in the background whenever a resource of a supported format is created or updated. Validation won't affect the action performed, so if there are validation errors found the resource will be created or updated anyway.

This mode might be useful for instances where datasets are harvested from other sources, or where multiple publishers create datasets and as a maintainer you only want to give visibility to the quality of data, encouraging publishers to fix any issues.

Expand Down Expand Up @@ -341,7 +350,7 @@ The extension requires changes in the default CKAN resource metadata schema to a
Here's more detail on the fields added:

* `schema`: This can be a [Table Schema](http://frictionlessdata.io/specs/table-schema/) JSON object or an URL pointing to one. In the UI form you can upload a JSON file, link to one providing a URL or enter it directly. If uploaded, the file contents will be read and stored in the `schema` field. In all three cases the contents will be validated against the Table Schema specification.
* `validation_options`: A JSON object with validation options that will be passed to [goodtables](https://github.com/frictionlessdata/goodtables-py#validatesource-options).
* `validation_options`: A JSON object with validation options that will be passed to [Goodtables](https://github.com/frictionlessdata/goodtables-py#validatesource-options).

![Form fields](https://i.imgur.com/ixKOCij.png)

Expand Down Expand Up @@ -523,7 +532,7 @@ def resource_validation_run_batch(context, data_dict):

### Starting the validation process manually

You can start (asynchronous) validation jobs from the command line using the `ckan validation run` command. If no parameters are provided it will start a validation job for all resources in the site of suitable format (ie `ckanext.validation.formats`):
You can start (asynchronous) validation jobs from the command line using the `validation run` command. If no parameters are provided it will start a validation job for all resources in the site of suitable format (ie `ckanext.validation.formats`):

ON CKAN >= 2.9:

Expand Down Expand Up @@ -566,13 +575,13 @@ ON CKAN >= 2.9:

ON CKAN <= 2.8:

paster validation report -c /path/to/ckan/ini
paster validation report -c /path/to/ckan/ini

paster validation report-full -c /path/to/ckan/ini
paster validation report-full -c /path/to/ckan/ini


Both commands will print an overview of the total number of datasets and tabular resources, and a breakdown of how many have a validation status of success,
failure or error. Additionally they will create a CSV report. `ckan validation report` will create a report with all failing resources, including the following fields:
failure or error. Additionally they will create a CSV report. `validation report` will create a report with all failing resources, including the following fields:

* Dataset name
* Resource id
Expand All @@ -581,7 +590,7 @@ failure or error. Additionally they will create a CSV report. `ckan validation r
* Status
* Validation report URL

`ckan validation report-full` will add a row on the output CSV for each error found on the validation report (limited to ten occurrences of the same error type per file). So the fields in the generated CSV report will be:
`validation report-full` will add a row on the output CSV for each error found on the validation report (limited to ten occurrences of the same error type per file). So the fields in the generated CSV report will be:

* Dataset name
* Resource id
Expand Down
Loading