Minor version release updating dependencies.
- Default Python version used in automated tests is changed to 3.12
- Pandas updated to 2.2.2 version
- SQLAlchemy updated to 2.x API
- Numpy updated to 2.x API
- Experimental support for using SQL to generate anonymising sets of values. This feature is available for all column types except numerical.
make_distinct
custom action now works on date columns.- You can now easily add a column with current date and time by using
'@sysdate'
as a derived column. - Pseudo-CHI numbers can be now generated by passing
pseudo_chi
as anonymising set to UUID columns. - Numerical column weights for categorical values are now optional. This should speed up the process of manually composing a specification.
- Minor bugs fixed in
shift_distribution
,make_outlier
andmake_distinct
. - Fixed a bug in regex distribution where the target number of uniques wasn't respected.
- Fixed date column not being recognized if source data had missing values.
- Python version changes to 3.10
- Pandas updated to 2.x version
- Using Exhibit as an importable library is now easier. Please see the scripting recipe for more details and examples.
anon.db
is now calledexhibit.db
. You can also now use 3rd party databases to store associated specification / aliasing data, as long as you have the required SQL Alchemy dialect installed. SetEXHIBIT_DB_SCHEMA
andEXHIBIT_DB_URL
environment variables and Exhibit will use those instead of the localexhibit.db
.- New custom action & filter pairs:
shift_distribution_right / left
andCOLUMN_NAME with_high / low_frequency
- You can now save probabilities for columns you marked as linked in the CLI.
- UUID columns can now be generated using incrementing integer values by setting
anonymising_set
torange
. You can also set different seeds for each UUID column.
- Improved the calculation of weights for numerical columns.
- Various other minor bug fixes and improvements to error messages.
- added
scipy
andsqlalchemy
as dependencies.
- Added experimental support for using pickled machine learning models as plug-ins. See the
Create Exhibit-compatible ML model.ipynb
recipe for details. - Added an option to save probabilities of values in columns that are put into the DB.
- Added performance and memory benchmarking.
- You can now reference custom lookups you added to the DB directly in the specification as long as specification columns and DB columns match.
- Added a
make_almost_same
custom action.
- Fixed a bug where generating a dataset without any categorical columns would give an error.
- Fixed a DB bug that gave missing data an equal chance to appear for columns where number of uniques exceeded the in-line limit.
- added
dill
as a dependency.
- Added 4 new custom actions to manipulate timeseries: given a numerical column and a timeseries column, create artificial skew (left or right) or add peak / valley.
generate_as_sequence
custom action has a new variant that lets you generate repeated sequences of values in the order that they appear in the spec, regardless of the probability vector.- You can now apply a single custom action to multiple columns by providing them as a comma-separated target string. The same applies to actions. The processing of custom constraints happens in the order in which column names / actions were specified.
- Fixed an issue where custom constraints wouldn't always respect original column types (float or Int64).
- Fixed an issue where column values generated from a regular expression pattern were inadvertently repeated under certain conditions.
- Fixed a bug with missing values in user linked columns.
- Fixed a bug that could result in linked column groups being in different order when re-running the generation of the same specification.
- numpy bumped to 1.22.
- Added experimental support for generating geospatial data. You can now generate point geometry with latitude and longitude coordinates sampled from H3 hexagons.
- Additionally, you can create random, but geographically-valid regions to match the partitions in the data. This is done using a new custom action called geo_make_regions.
- You can now add noise to user linked column groups. Rather than mirroring the original relationships exactly, links can be formed between random column values based on a specified probability.
- Added a new custom action: generate_as_sequence. This action is useful when generated values much follow a specific order in a partition, like vaccine doses administered to an individual: "full schedule" before "booster". This is different from sorting because sorting happens after the data has been generated, whereas generate_as_sequence will ensure that "booster" is never generated by itself - only when preceded by "full schedule".
- When generating missing data, there was a chance that missing values will be generated in the same rows for different columns rather than independently.
- Fixed a number of issues around nullable integer type.
- pandas bumped to 1.4.2.
- numpy bumped to 1.21.5.
- PyYAML bumped to 6.0.
- When asking Exhibit to generate a specification from a dataset that didn't contain any numerical columns, the resulting specification was missing probability information for categorical columns below the in-line limit.
- Revised the specification of custom constraints (previously called conditional constraints). Now you can specify the subset (filter) of the data, the partition, and one or more columns to be affected by the constraints. In addition to the
make_null
,make_not_null
andmake_outlier
, there are 4 new constraints available:make_same
,make_distinct
,sort_ascending
andsort_descending
. - Added an option to generate uuid columns. If your source dataset includes record-level data with unique identifiers, you can exclude them from processing and generate them separately. The uuid columns work differently to normal categorical columns in that you specify the probabilities of the frequency of each unique value appearing in your synthetic dataset. See
uuid_demo.yml
anduuid_anon.csv
files for examples. - Added an option to designate numerical columns like age or dose number as categorical. List such columns after the --discrete_columns flag in CLI.
- Fixed a RNG-related bug that could result in slightly different datasets being generated on Linux and Windows from the same specification.
- You can now use Exhibit as an importable library, not just as a CLI program. See
recipes/exhibit_scripting.py
for examples of the basic API. - Exhibit now correctly handles columns composed entirely out of
boolean
values. For the purposes of dataset generation they are treated as categorical rather than numerical values.
Hotfix a Windows-specific bug related to SQLite3 type adaptors.
First beta release ready for limited use in production.