Skip to content

Latest commit

 

History

History
119 lines (89 loc) · 7.47 KB

CHANGELOG.md

File metadata and controls

119 lines (89 loc) · 7.47 KB

Release notes


0.9.9 (September 9, 2024)

Minor version release updating dependencies.

Notable version upgrades
  • Default Python version used in automated tests is changed to 3.12
  • Pandas updated to 2.2.2 version
  • SQLAlchemy updated to 2.x API
  • Numpy updated to 2.x API

0.9.8 (November 20, 2023)

Enhancements
  • Experimental support for using SQL to generate anonymising sets of values. This feature is available for all column types except numerical.
  • make_distinct custom action now works on date columns.
  • You can now easily add a column with current date and time by using '@sysdate' as a derived column.
  • Pseudo-CHI numbers can be now generated by passing pseudo_chi as anonymising set to UUID columns.
  • Numerical column weights for categorical values are now optional. This should speed up the process of manually composing a specification.
Bug fixes
  • Minor bugs fixed in shift_distribution, make_outlier and make_distinct.
  • Fixed a bug in regex distribution where the target number of uniques wasn't respected.
  • Fixed date column not being recognized if source data had missing values.
Package version upgrades
  • Python version changes to 3.10
  • Pandas updated to 2.x version

0.9.7 (November 15, 2022)

Enhancements
  • Using Exhibit as an importable library is now easier. Please see the scripting recipe for more details and examples.
  • anon.db is now called exhibit.db. You can also now use 3rd party databases to store associated specification / aliasing data, as long as you have the required SQL Alchemy dialect installed. Set EXHIBIT_DB_SCHEMA and EXHIBIT_DB_URL environment variables and Exhibit will use those instead of the local exhibit.db.
  • New custom action & filter pairs: shift_distribution_right / left and COLUMN_NAME with_high / low_frequency
  • You can now save probabilities for columns you marked as linked in the CLI.
  • UUID columns can now be generated using incrementing integer values by setting anonymising_set to range. You can also set different seeds for each UUID column.
Bug fixes
  • Improved the calculation of weights for numerical columns.
  • Various other minor bug fixes and improvements to error messages.
Package version upgrades
  • added scipy and sqlalchemy as dependencies.

0.9.6 (August 3, 2022)

Enhancements
  • Added experimental support for using pickled machine learning models as plug-ins. See the Create Exhibit-compatible ML model.ipynb recipe for details.
  • Added an option to save probabilities of values in columns that are put into the DB.
  • Added performance and memory benchmarking.
  • You can now reference custom lookups you added to the DB directly in the specification as long as specification columns and DB columns match.
  • Added a make_almost_same custom action.
Bug fixes
  • Fixed a bug where generating a dataset without any categorical columns would give an error.
  • Fixed a DB bug that gave missing data an equal chance to appear for columns where number of uniques exceeded the in-line limit.
Package version upgrades
  • added dill as a dependency.

0.9.5 (June 26, 2022)

Enhancements
  • Added 4 new custom actions to manipulate timeseries: given a numerical column and a timeseries column, create artificial skew (left or right) or add peak / valley.
  • generate_as_sequence custom action has a new variant that lets you generate repeated sequences of values in the order that they appear in the spec, regardless of the probability vector.
  • You can now apply a single custom action to multiple columns by providing them as a comma-separated target string. The same applies to actions. The processing of custom constraints happens in the order in which column names / actions were specified.
Bug fixes
  • Fixed an issue where custom constraints wouldn't always respect original column types (float or Int64).
  • Fixed an issue where column values generated from a regular expression pattern were inadvertently repeated under certain conditions.
  • Fixed a bug with missing values in user linked columns.
  • Fixed a bug that could result in linked column groups being in different order when re-running the generation of the same specification.
Package version upgrades
  • numpy bumped to 1.22.

0.9.4 (June 6, 2022)

Enhancements
  • Added experimental support for generating geospatial data. You can now generate point geometry with latitude and longitude coordinates sampled from H3 hexagons.
  • Additionally, you can create random, but geographically-valid regions to match the partitions in the data. This is done using a new custom action called geo_make_regions.
  • You can now add noise to user linked column groups. Rather than mirroring the original relationships exactly, links can be formed between random column values based on a specified probability.
  • Added a new custom action: generate_as_sequence. This action is useful when generated values much follow a specific order in a partition, like vaccine doses administered to an individual: "full schedule" before "booster". This is different from sorting because sorting happens after the data has been generated, whereas generate_as_sequence will ensure that "booster" is never generated by itself - only when preceded by "full schedule".
Bug fixes
  • When generating missing data, there was a chance that missing values will be generated in the same rows for different columns rather than independently.
  • Fixed a number of issues around nullable integer type.
Package version upgrades
  • pandas bumped to 1.4.2.
  • numpy bumped to 1.21.5.
  • PyYAML bumped to 6.0.

0.9.3 (April 19, 2022)

Bug fixes
  • When asking Exhibit to generate a specification from a dataset that didn't contain any numerical columns, the resulting specification was missing probability information for categorical columns below the in-line limit.
Enhancements
  • Revised the specification of custom constraints (previously called conditional constraints). Now you can specify the subset (filter) of the data, the partition, and one or more columns to be affected by the constraints. In addition to the make_null, make_not_null and make_outlier, there are 4 new constraints available: make_same, make_distinct, sort_ascending and sort_descending.
  • Added an option to generate uuid columns. If your source dataset includes record-level data with unique identifiers, you can exclude them from processing and generate them separately. The uuid columns work differently to normal categorical columns in that you specify the probabilities of the frequency of each unique value appearing in your synthetic dataset. See uuid_demo.yml and uuid_anon.csv files for examples.
  • Added an option to designate numerical columns like age or dose number as categorical. List such columns after the --discrete_columns flag in CLI.

0.9.2 (February 14, 2022)

Bug fixes
  • Fixed a RNG-related bug that could result in slightly different datasets being generated on Linux and Windows from the same specification.
Enhancements
  • You can now use Exhibit as an importable library, not just as a CLI program. See recipes/exhibit_scripting.py for examples of the basic API.
  • Exhibit now correctly handles columns composed entirely out of boolean values. For the purposes of dataset generation they are treated as categorical rather than numerical values.

0.9.1 (December 5, 2021)

Hotfix a Windows-specific bug related to SQLite3 type adaptors.

0.9.0 (December 4, 2021)

First beta release ready for limited use in production.