Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastADC implementation #470

Merged
merged 96 commits into from
Jan 30, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
5bab0ad
Replace int with int64_t in Predicate class
ol-imorozko Sep 29, 2024
40ef53b
Initial commit that adds dc folder and placeholder for dc.h
ol-imorozko Feb 25, 2024
4836903
Implement method to get value from TypedColumnData
ol-imorozko Sep 24, 2024
b4b4775
Implement IndexProvider class
ol-imorozko Mar 2, 2024
e370750
Implement functions to get similarities metrics between two columns
ol-imorozko Mar 6, 2024
3f2bc2d
Implement PrediateBuilder class
ol-imorozko Mar 7, 2024
7938d6b
Implement tests for predicate space building
ol-imorozko Mar 8, 2024
99f6308
Implement Pli and PliShardBuilder
ol-imorozko Mar 16, 2024
7564338
Implement CommonClueSetBuilder
ol-imorozko Sep 21, 2024
c5ba8ac
Implement SingleClueSetBuilder
ol-imorozko Sep 21, 2024
0f51459
Implement CrossClueSetBuilder
ol-imorozko Sep 21, 2024
c1d19df
Add test that checks static fields of CommonClueSetBuilder
ol-imorozko May 1, 2024
ea9237b
Implement ClueSetBuilder
ol-imorozko Sep 21, 2024
06322d9
Add test that checks ClueSet building
ol-imorozko Sep 24, 2024
7ec33a4
Add an ability to force kString type on TypedColumnData instead of kM…
ol-imorozko Sep 24, 2024
a64b7be
Add initial EvidenceSetBuilder class that builds cardinality mask
ol-imorozko Sep 29, 2024
bb175fc
Add test that verifies CardinalityMask
ol-imorozko Sep 29, 2024
81db940
Implement Evidence
ol-imorozko Sep 29, 2024
f72b49b
Implement EvidenceSet
ol-imorozko Sep 29, 2024
74f3c59
Implement EvidenceSetBuilder
ol-imorozko Sep 29, 2024
5561bae
Add test to verify evidence set
ol-imorozko Sep 29, 2024
14bbecc
Fix wrong creating of inverted predicate, operands were swapped
ol-imorozko Oct 3, 2024
9b8f4ad
Add type alias for bitset holding predicates
ol-imorozko Oct 1, 2024
310bd56
Implement PredicateOrganizer class
ol-imorozko Oct 1, 2024
6141589
Add test that validates predicate organizer
ol-imorozko Oct 1, 2024
610cdd8
Implement DCCandidateTrie class
ol-imorozko Oct 1, 2024
b2d25c8
Implement PredicateSet class
ol-imorozko Mar 2, 2024
6e167a5
Implement DenialConstraint class
ol-imorozko Oct 2, 2024
9cccbdf
Return reference from GetImplications Predicate method
ol-imorozko Oct 2, 2024
9b8c61b
Implement Closure class
ol-imorozko Oct 2, 2024
b8a4d13
Implement NTreeSearch class
ol-imorozko Oct 2, 2024
d4b445a
Implement DenialConstraintSet
ol-imorozko Oct 2, 2024
4765983
Implement ApproximateEvidenceInverter class
ol-imorozko Oct 2, 2024
f1ef3d0
Implement test for approximate denial constraints
ol-imorozko Oct 2, 2024
a5c1344
Change namespace model to namespace algos::fastadc for FastADC files
ol-imorozko Oct 3, 2024
4c93953
Split FastADC files into subfolders
ol-imorozko Oct 3, 2024
c588d6a
Correct includes paths after renaming and moving FastADC files
ol-imorozko Oct 3, 2024
156926a
Refactor providers* structures
ol-imorozko Oct 3, 2024
6ca4903
Adjust unittests after providers refactoring
ol-imorozko Oct 4, 2024
1f5eb8b
Extract predicate packs and correction map building to a separate class
ol-imorozko Oct 5, 2024
baacf36
Move cardinality mask building from Evidence set to a new structure
ol-imorozko Oct 5, 2024
1ac1976
Remove unused clue field from Evidence class
ol-imorozko Oct 5, 2024
b62f3ad
Remove unused N field from SearchNode class
ol-imorozko Oct 5, 2024
99274ef
Optimize AccumulateClues by hashing clue with zero value
ol-imorozko Oct 5, 2024
424e2cf
Increase performance of AccumulateClues by preallocating and utilizin…
ol-imorozko Oct 5, 2024
a013a51
Optimize clues by moving allocations out of Build* methods
ol-imorozko Oct 5, 2024
23c3be7
Change predicate bitset size from 64 to 128
ol-imorozko Oct 6, 2024
b9e2350
Add missing cvs file for DC mining testing
ol-imorozko Jan 13, 2025
40286e4
Fix clang-format header recommendtaion
ol-imorozko Dec 1, 2024
24202c0
Add DependentFalse to namespace::details
ol-imorozko Dec 1, 2024
ddbb0d7
Remove inline from template functions
ol-imorozko Dec 1, 2024
272ffaf
Remove redundant static variables
ol-imorozko Dec 1, 2024
e246296
Don't define several variables on the same line
ol-imorozko Dec 1, 2024
4f0c40d
Do not use std::initializer_list to store anything, replace with cons…
ol-imorozko Dec 1, 2024
971ccce
NTreeSearch() = default Not needed
ol-imorozko Dec 1, 2024
85884c8
Rename GetInverse/MutexMap to Take since we're moving them
ol-imorozko Dec 1, 2024
f97dd82
Capture only coverages instead of everything in labda
ol-imorozko Dec 1, 2024
a25047f
To mimic Java's behavior, return 0.0 when avg1=avg2=0 in GetAverageRatio
ol-imorozko Dec 1, 2024
8d3e759
Do not reopen namespace std when specializing std::hash
ol-imorozko Dec 1, 2024
2308f24
No need to define hash_value, declaration is enough to use boost::has…
ol-imorozko Dec 1, 2024
721243d
Use BetterEnum instead of bool to indicate tuple in ColumnOperand
ol-imorozko Dec 2, 2024
4c091e6
Apply IWYU to src/core/algorithms/dc/FastADC
ol-imorozko Dec 2, 2024
39fe20b
Don't use relative paths in inlcudes
ol-imorozko Dec 2, 2024
e8f2f83
Apply clang-format after header changes
ol-imorozko Dec 2, 2024
149a8bd
Remove unnecessary assert, predicate_index_provider can't be null the…
ol-imorozko Dec 2, 2024
557c734
Define operator!= of DenialConstraint
ol-imorozko Dec 2, 2024
23bdb9a
Move Initizlize*Map in Operator to private section and add alias for map
ol-imorozko Dec 2, 2024
9c6e536
Use default == and != operators from Operator
ol-imorozko Dec 2, 2024
40c8c8f
Add TODO comments for classes that are both used for DC mining and ve…
ol-imorozko Jan 13, 2025
d77e696
Use emplace back where needed
ol-imorozko Jan 13, 2025
5200963
Make PliShard fields private
ol-imorozko Jan 13, 2025
42875dc
Do not explicitly delete PredicateSet constructor
ol-imorozko Jan 13, 2025
4d48c56
Make GetIndex in IndexProvider accept references
ol-imorozko Jan 13, 2025
232fb73
Add inline to operator& and &= in aei.h to avoid violating ODR
ol-imorozko Jan 13, 2025
8a82902
Don't capture all variables by references in aei.h
ol-imorozko Jan 13, 2025
0e6dc18
Remove shared_ptrs in aei.h and simplify Hit method
ol-imorozko Jan 13, 2025
526c212
Get rid of try-catch in clue builder by adding a helper function to c…
ol-imorozko Jan 13, 2025
dce9fac
Split variable definitions into several lines
ol-imorozko Jan 13, 2025
c12f0ff
Add missing braces around if-else
ol-imorozko Jan 13, 2025
5bfb38e
Move BuildClueSet implementation to cpp file
ol-imorozko Jan 13, 2025
c9947d4
Make CompareBitsets a static method of DenialConstraintSet
ol-imorozko Jan 13, 2025
6e7c09c
Implement FastADC algorithm as a class derived from Algorithm
ol-imorozko Jan 13, 2025
131e71f
Add FastADC python bindings
ol-imorozko Jan 13, 2025
07676b3
Correct clang-format issues
ol-imorozko Jan 14, 2025
8a36ac2
Bring back missing tmp_dc.cvs and rename it appropriately
ol-imorozko Jan 14, 2025
22a809a
Add example of ADC mining
ol-imorozko Jan 14, 2025
e9b8332
Some fixes addressed in pull request
ol-imorozko Jan 26, 2025
7e9f9de
Use frozen::unordered_map in Operator class
ol-imorozko Jan 27, 2025
e2ca82d
Use full names in includes
ol-imorozko Jan 27, 2025
57ab2f8
Split TransitivityStep method with helper methods in Closure
ol-imorozko Jan 27, 2025
f0924c1
Refactor NTreeSearch
ol-imorozko Jan 27, 2025
f6b8050
Use CompareResult in DenialConstrainSet
ol-imorozko Jan 27, 2025
436d9b5
Use copy-and-swap idiom in PliShard
ol-imorozko Jan 27, 2025
4af6fe9
Apply clang-format
ol-imorozko Jan 27, 2025
7203420
Add Denial Constraints mining to README and README_PYPI
ol-imorozko Jan 28, 2025
6610ec2
Fix issues addressed in pr
ol-imorozko Jan 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ The currently supported data patterns are:
* Association rules (discovery)
* Numerical association rules (discovery)
* Matching dependencies (discovery)
* Variable heterogeneous denial constraints (validation)
* Denial constraints
- Exact denial constraints (discovery and validation)
- Approximate denial constraints, with $g_1$ metric (discovery)

The discovered patterns can have many uses:
* For scientific data, especially those obtained experimentally, an interesting pattern allows to formulate a hypothesis that could lead to a scientific discovery. In some cases it even allows to draw conclusions immediately, if there is enough data. At the very least, the found pattern can provide a direction for further study.
Expand Down
4 changes: 3 additions & 1 deletion README_PYPI.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,9 @@ The currently supported data patterns are:
* Association rules (discovery)
* Numerical association rules (discovery)
* Matching dependencies (discovery)
* Variable heterogeneous denial constraints (validation)
* Denial constraints
- Exact denial constraints (discovery and validation)
- Approximate denial constraints, with $g_1$ metric (discovery)

This package uses the library of the Desbordante platform, which is written in C++. This means that depending on the algorithm and dataset, the runtimes may be cut by 2-10 times compared to the alternatives.

Expand Down
1 change: 1 addition & 0 deletions examples/basic/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ These scenarios showcase a single pattern by discussing its definition and provi
+ [mining_set_od_1.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_set_od_1.py) — a scenario showing how to discover order dependencies based on set axiomatization, part 1.
+ [mining_set_od_2.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_set_od_2.py) — a scenario showing how to discover order dependencies based on set axiomatization, part 2.
+ [mining_ucc.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_ucc.py) — a scenario showing how to discover exact unique column combinations.
+ [mining_adc.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/mining_adc.py) — a scenario showing how to discover an approximate denial constraints.
+ [verifying_aucc.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/verifying_aucc.py) — a scenario showing how to verify an approximate unique column combination.
+ [verifying_fd_afd.py](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/verifying_fd_afd.py) — a scenario showing how to verify exact and approximate functional dependencies.
+ [verifying_gfd](https://github.com/Desbordante/desbordante-core/tree/main/examples/basic/verifying_gfd) — a scenario showing how to verify a graph functional dependency.
Expand Down
158 changes: 158 additions & 0 deletions examples/basic/mining_adc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
import desbordante as db
import pandas as pd

RED = '\033[31m'
YELLOW = '\033[33m'
GREEN = '\033[32m'
CYAN = '\033[1m\033[36m'
ENDC = '\033[0m'

TABLE_1 = "examples/datasets/taxes.csv"
TABLE_2 = "examples/datasets/taxes_2.csv"

def print_table(filename: str, title: str = "") -> None:
if title:
print(f"{title}")
data = pd.read_csv(filename, header=0)
print(data, end="\n\n")

def main():
print(f"""{YELLOW}Understanding Denial Constraints (DCs){ENDC}
In this walkthrough, we follow the definitions described in the paper
\"Fast approximate denial constraint discovery\" by Xiao, Tan,
Wang, and Ma (2022) [Proc. VLDB Endow. 16(2), 269–281].

A Denial Constraint is a statement that says: "For all pairs of different rows in a table,
it should never happen that some condition holds."
Formally, DC {CYAN}φ{ENDC} is a conjunction of predicates of the following form:
{CYAN}∀s, t ∈ R, s ≠ t: ¬(p_1 ∧ . . . ∧ p_m){ENDC}

For example, look at this small table:
Name Grade Salary
Alice 3 3000
Bob 4 4000
Carol 4 4000

A possible DC here is: {CYAN}¬{{ t.Grade == s.Grade ∧ t.Salary != s.Salary }}{ENDC}

This means: "It should never happen that two people have the same grade but different salaries.",
or in other words, if two rows share the same Grade, they must share the same Salary.

Sometimes, we allow a DC to hold approximately, which means a small number of row pairs
might violate it. The measure used for that is the 'g1' metric. Roughly, the 'g1' metric
checks what fraction of all row pairs violates the DC, and if that fraction is lower than
a chosen threshold, we consider the DC 'valid enough.'
""")

print(f"""{YELLOW}Mining Denial Constraints{ENDC}
We have two parameters in Desbordante's DC mining algorithm:
1) evidence_threshold: This sets the fraction of row pairs that must satisfy the DC
for it to be considered valid. A value of 0 means exact DC mining (no violations allowed).
2) shard_length: This splits the dataset into row "shards" for parallelization.
A value of 0 means no split, so the entire dataset is processed at once.

{YELLOW}Let's begin by looking at TABLE_1:{ENDC}""")

print_table(TABLE_1, "TABLE_1 (examples/datasets/taxes.csv):")

print(f"""{YELLOW}Mining exact DCs (evidence_threshold=0) on TABLE_1{ENDC}""")

# Exact DC mining on TABLE_1
algo = db.dc.algorithms.Default()
algo.load_data(table=(TABLE_1, ',', True))
algo.execute(evidence_threshold=0, shard_length=0)
dcs_table1_exact = algo.get_dcs()

print(f"{YELLOW}Discovered DCs:{ENDC}")
for dc in dcs_table1_exact:
print(f" {CYAN}{dc}{ENDC}")
print()

print(f"""Note the following Denial Constraint we found:
{CYAN}¬{{ t.State == s.State ∧ t.Salary <= s.Salary ∧ t.FedTaxRate >= s.FedTaxRate }}{ENDC}.
It states that for all people in the same state, the person with a higher salary
should have a higher tax rate. No pairs of rows should violate that rule.

Now let's mine approximate DCs by setting evidence_threshold to 0.5.
This means we only require that at least half of all row pairs satisfy each DC (according to 'g1').
""")

print(f"""{YELLOW}Mining ADCs (evidence_threshold=0.5) on TABLE_1{ENDC}""")

# Approximate DC mining on TABLE_1
algo = db.dc.algorithms.Default()
algo.load_data(table=(TABLE_1, ',', True))
algo.execute(evidence_threshold=0.5, shard_length=0)
dcs_table1_approx = algo.get_dcs()

print(f"{YELLOW}Discovered ADCs:{ENDC}")
for dc in dcs_table1_approx:
print(f" {CYAN}{dc}{ENDC}")
print()

print(f"""Here, for example, the 'g1' metric values for a few approximate DCs are:
{CYAN}¬{{ t.Salary <= s.Salary ∧ t.FedTaxRate <= s.FedTaxRate }}{ENDC} → 0.486111
{CYAN}¬{{ t.Salary <= s.Salary ∧ t.FedTaxRate >= s.FedTaxRate }}{ENDC} → 0.458333
{CYAN}¬{{ t.State == s.State }}{ENDC} → 0.25
Note: A smaller 'g1' value means fewer violations, making the DC more exact.
""")

print(f"""{YELLOW}Conclusion:{ENDC}
We found both exact and approximate DCs.

- Exact DCs are those with zero violations, so they must hold for every pair of rows.
- Approximate DCs allow some fraction of violating pairs.

Therefore, an approximate DC can logically imply the exact one.
For example, consider:
Exact DC: {CYAN}¬{{ t.State == s.State ∧ t.Salary == s.Salary }}{ENDC}
Approximate DC: {CYAN}¬{{ t.Salary == s.Salary }}{ENDC}

If the approximate DC (which prohibits any two rows from having the same Salary)
is satisfied for at least the chosen threshold, then clearly no two rows can share both
the same State and the same Salary. Thus, the approximate DC implies the exact DC.

In real scenarios, exact DCs may be too rigid.
Allowing a small fraction of violations is often a practical compromise,
but setting a very high threshold quickly becomes meaningless
since it would permit too many inconsistencies.
The best threshold often depends on how 'dirty' the data is; datasets with
more inconsistencies may require a higher threshold to capture meaningful DCs.
""")

print(f"""{YELLOW}Now let's move on to TABLE_2{ENDC}""")

print_table(TABLE_2, "TABLE_2 (examples/datasets/taxes_2.csv):")

print(f"""We added this record for Texas:
{GREEN}(State=Texas, Salary=5000, FedTaxRate=0.05){ENDC}
Notice how it introduces a scenario that breaks the DC we discuissed earlier, stating
"the person with a higher salary should have a higher tax rate,"
because there are now people in Texas with a lower salary but a higher tax rate.

Let's see how the exact DC mining changes due to this additional record.
""")

print(f"""{YELLOW}Mining exact DCs (evidence_threshold=0) on TABLE_2{ENDC}""")

# Exact DC mining on TABLE_2
algo = db.dc.algorithms.Default()
algo.load_data(table=(TABLE_2, ',', True))
algo.execute(evidence_threshold=0, shard_length=0)
dcs_table2_exact = algo.get_dcs()

print(f"{YELLOW}Discovered DCs:{ENDC}")
for dc in dcs_table2_exact:
print(f" {CYAN}{dc}{ENDC}")
print()

print(f"""We can see that the DC {CYAN}¬{{ t.State == s.State ∧ t.Salary <= s.Salary ∧ t.FedTaxRate >= s.FedTaxRate }}{ENDC}
no longer appears because of the violation introduced by record index 9
({GREEN}(Texas, 5000, 0.05){ENDC}).

Those violations occur in pairs like {RED}(6, 9), (7, 9), (8, 9){ENDC},
where each number is a record index in the dataset.""")

if __name__ == "__main__":
main()

146 changes: 146 additions & 0 deletions src/core/algorithms/dc/FastADC/fastadc.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
#include "algorithms/dc/FastADC/fastadc.h"

#include <stdexcept>
#include <vector>

#include <easylogging++.h>

ol-imorozko marked this conversation as resolved.
Show resolved Hide resolved
#include "config/names_and_descriptions.h"
#include "config/option.h"
#include "config/option_using.h"
#include "config/tabular_data/input_table/option.h"
#include "dc/FastADC/model/pli_shard.h"
#include "dc/FastADC/util/approximate_evidence_inverter.h"
#include "dc/FastADC/util/evidence_aux_structures_builder.h"
#include "dc/FastADC/util/evidence_set_builder.h"
#include "dc/FastADC/util/predicate_builder.h"
#include "model/table/column_layout_typed_relation_data.h"

namespace algos::dc {

FastADC::FastADC() : Algorithm({}) {
RegisterOptions();
MakeOptionsAvailable({config::kTableOpt.GetName()});
}

void FastADC::RegisterOptions() {
DESBORDANTE_OPTION_USING;

config::InputTable default_table;

RegisterOption(config::kTableOpt(&input_table_));
RegisterOption(Option{&shard_length_, kShardLength, kDShardLength, 350U});
RegisterOption(Option{&allow_cross_columns_, kAllowCrossColumns, kDAllowCrossColumns, true});
RegisterOption(Option{&minimum_shared_value_, kMinimumSharedValue, kDMinimumSharedValue, 0.3});
RegisterOption(
Option{&comparable_threshold_, kComparableThreshold, kDComparableThreshold, 0.1});
RegisterOption(Option{&evidence_threshold_, kEvidenceThreshold, kDEvidenceThreshold, 0.01});
}

void FastADC::MakeExecuteOptsAvailable() {
using namespace config::names;

MakeOptionsAvailable({kShardLength, kAllowCrossColumns, kMinimumSharedValue,
kComparableThreshold, kEvidenceThreshold});
}

void FastADC::LoadDataInternal() {
// kMixed type will be treated as a string type
typed_relation_ = model::ColumnLayoutTypedRelationData::CreateFrom(*input_table_, true, true);

if (typed_relation_->GetColumnData().empty()) {
throw std::runtime_error("Got an empty dataset: DC mining is meaningless.");
}
}

void FastADC::SetLimits() {
unsigned all_rows_num = typed_relation_->GetNumRows();

if (shard_length_ > all_rows_num) {
throw std::invalid_argument(
"'shard_length' (" + std::to_string(shard_length_) +
") must be less or equal to the number of rows in the table (total "
"rows: " +
std::to_string(all_rows_num) + ")");
}
if (shard_length_ == 0) shard_length_ = all_rows_num;
}

void FastADC::CheckTypes() {
model::ColumnIndex columns_num = typed_relation_->GetNumColumns();
unsigned rows_num = typed_relation_->GetNumRows();

for (model::ColumnIndex column_index = 0; column_index < columns_num; column_index++) {
model::TypedColumnData const& column = typed_relation_->GetColumnData(column_index);
model::TypeId type_id = column.GetTypeId();

if (type_id == +model::TypeId::kMixed) {
LOG(WARNING) << "Column with index \"" + std::to_string(column_index) +
"\" contains values of different types. Those values will be "
"treated as strings.";
} else if (!column.IsNumeric() && type_id != +model::TypeId::kString) {
throw std::invalid_argument(
"Column with index \"" + std::to_string(column_index) +
"\" is of unsupported type. Only numeric and string types are supported.");
}

for (std::size_t row_index = 0; row_index < rows_num; ++row_index) {
if (column.IsNullOrEmpty(row_index)) {
throw std::runtime_error("Some of the value coordinates are null or empty.");
}
}
}
}

void FastADC::PrintResults() {
LOG(DEBUG) << "Total denial constraints: " << dcs_.TotalDCSize();
LOG(DEBUG) << "Minimal denial constraints: " << dcs_.MinDCSize();
LOG(DEBUG) << dcs_.ToString();
}

unsigned long long FastADC::ExecuteInternal() {
auto const start_time = std::chrono::system_clock::now();
LOG(DEBUG) << "Start";

SetLimits();
CheckTypes();

PredicateBuilder predicate_builder(&pred_provider_, &pred_index_provider_, allow_cross_columns_,
minimum_shared_value_, comparable_threshold_);
predicate_builder.BuildPredicateSpace(typed_relation_->GetColumnData());

PliShardBuilder pli_shard_builder(&int_prov_, &double_prov_, &string_prov_, shard_length_);
pli_shard_builder.BuildPliShards(typed_relation_->GetColumnData());

EvidenceAuxStructuresBuilder evidence_aux_structures_builder(predicate_builder);
evidence_aux_structures_builder.BuildAll();

EvidenceSetBuilder evidence_set_builder(pli_shard_builder.pli_shards,
evidence_aux_structures_builder.GetPredicatePacks());
evidence_set_builder.BuildEvidenceSet(evidence_aux_structures_builder.GetCorrectionMap(),
evidence_aux_structures_builder.GetCardinalityMask());

LOG(DEBUG) << "Built evidence set";
auto elapsed_milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now() - start_time);
LOG(DEBUG) << "Current time: " << elapsed_milliseconds.count();

ApproxEvidenceInverter dcbuilder(predicate_builder, evidence_threshold_,
std::move(evidence_set_builder.evidence_set));

dcs_ = dcbuilder.BuildDenialConstraints();

PrintResults();

elapsed_milliseconds = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now() - start_time);
LOG(DEBUG) << "Algorithm time: " << elapsed_milliseconds.count();
return elapsed_milliseconds.count();
}

// TODO: mb make this a list?
std::vector<DenialConstraint> const& FastADC::GetDCs() const {
return dcs_.GetResult();
}

} // namespace algos::dc
61 changes: 61 additions & 0 deletions src/core/algorithms/dc/FastADC/fastadc.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#pragma once

#include <memory>
#include <vector>

#include "algorithms/algorithm.h"
#include "dc/FastADC/providers/predicate_provider.h"
#include "dc/FastADC/util/denial_constraint_set.h"
#include "model/denial_constraint.h"
#include "table/column_layout_typed_relation_data.h"
#include "tabular_data/input_table_type.h"

namespace algos::dc {

using namespace fastadc;

class FastADC : public Algorithm {
private:
unsigned shard_length_;
bool allow_cross_columns_;
double minimum_shared_value_;
double comparable_threshold_;
double evidence_threshold_;

config::InputTable input_table_;
std::unique_ptr<model::ColumnLayoutTypedRelationData> typed_relation_;

PredicateIndexProvider pred_index_provider_;
PredicateProvider pred_provider_;
IntIndexProvider int_prov_;
DoubleIndexProvider double_prov_;
StringIndexProvider string_prov_;
DenialConstraintSet dcs_;

void MakeExecuteOptsAvailable() override;
void LoadDataInternal() override;

void SetLimits();
void CheckTypes();
void PrintResults();

void ResetState() final {
pred_index_provider_.Clear();
pred_provider_.Clear();
int_prov_.Clear();
double_prov_.Clear();
string_prov_.Clear();
dcs_.Clear();
}

unsigned long long ExecuteInternal() final;

void RegisterOptions();

public:
FastADC();

std::vector<DenialConstraint> const& GetDCs() const;
};

} // namespace algos::dc
Loading