FastADC implementation #470

ol-imorozko · 2024-10-05T17:41:11Z

FastADC was introduced in this paper to efficiently mine Denial Constraints (DCs). A DC states that for any pair of rows, it should never be the case that some condition (e.g., t.A = s.A and t.B ≠ s.B) is satisfied.

This PR implements that algorithm.

github-actions

clang-tidy made some suggestions

src/tests/test_dc_structures_correct_results.h

src/core/algorithms/dc/FastADC/misc/misc.h

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

src/core/algorithms/dc/FastADC/util/evidence_aux_structures_builder.cpp

src/core/algorithms/dc/FastADC/util/evidence_aux_structures_builder.h

src/core/algorithms/dc/FastADC/util/ntree_search.h

src/core/algorithms/dc/FastADC/util/predicate_builder.h

src/core/algorithms/dc/FastADC/util/predicate_organizer.h

TypedColumData kInt type is int64_t, and FastADC algorithm uses 64-bit long types

FastADC algorithm for mining approximate Denial Constraints will be implemented here.

IndexProvider assigns unique indices to each distinct object of type T added to it. It will be used later for two main operations: 1. Map out all predicates to numbers to use dynamic bitsets for quick intersection/etc. 2. Hash all values in the table keeping their relative order (ignoring columns of DC-unsupported types. Only ints, doubles and strings are allowed). That is, the same values are substituited by the same integers, and higher values are replaced by larger integers

FastADC algotithm decides which column pairs to use to create predicate with by checking whether they are comparable with `==, !=, <, >, >=, <=` or with `==, !=`. In both cases when the columns are of expected type (string, int or double) but different, we need to assert some kind of similarity between them. Otherwise the predicate space will be too big and not really interesting from the DC finding stadpoint, since there will be predicates like `!=` in between two completely different data attributes. These metrics are: - "shared percentage" Measures the overlap between two columns by considering the frequency of each unique element. It calculates the frequency of each unique value in both columns and determines the ratio of the shared values to the total values. - "average ratio" Computes the average value of each column and then returns the ratio of the smaller average to the larger average.

Generates and categorizes predicates for the future evidence set construction

This test builds predicate space from the provided data (CSV file) and compares the list of predicates that will be used later for DC discovery with the expected one. The expected list of predicates was built manually from running the FastADC Java implementation. The next test check that inverse and mutex maps are being built correctly.

This commit introduces the Position List Indexes (Pli) building. It's working with hashed column data, such that that equal values are represented by identical keys, and values are sorted by their natural order. We also build a so-called PliShards, which are just Pli's for a specific segment of the dataset, splitting whole dataset into a bunch of shards. This will allow us to be more efficient later.

This class organizes predicates into packs and creates a correction map, which will be used for optimizing predicate comparisons in derived clasees, that will actually build clues from PLIs

Inherits from CommonClueSetBuilder and builds clues based from one PLI shard

Inherits from CommonClueSetBuilder and builds clues based from two PLI shards

Validates the number of bits in the clue, the structure of the predicate packs. And the correction map which stores predicate-to-bitset mappings

This is a class for constructing clues from PliShards.

The expected values are, once again, are taken from Java implementation

github-actions

clang-tidy made some suggestions

src/core/algorithms/dc/FastADC/model/pli_shard.h

…ixed

For now this class builds necessary structures to build Evidences later. The structures are clue set, correction map and cardinality mask.

This is class that maps 1to1 with Clue. The ApproximateEvidenceInversion algorithm (AEI) that will build approximate denial constraints is using Evidences as it's input

EvidenceSet is basically just a vector of evidences. The only thing that's adding is a method to get total count (I probably can publically inherit from std::vector<Evidence>...?)

Add the building of evidences

…heck first

p-senichenkov

All my comments on this PR will address specific issues related to variations of Clang (see #507). You can also rebase to my branch (PR#507) and see CI failures.

I'm going to add a wiki page with detailed description of Clang issues, but it'll take some time, so now, if you have questions, you'll better reach me on Telegram or here.

Now I'm posting only first part of comments, those related to build phase. I'll post more comments, if there will be problems with tests.

src/core/algorithms/dc/FastADC/model/predicate.h

p-senichenkov · 2025-01-14T13:16:18Z

src/core/algorithms/dc/FastADC/util/predicate_organizer.h

@@ -0,0 +1,98 @@
+#pragma once
+
+#include "dc/FastADC/model/evidence_set.h"


Suggested change

#include "dc/FastADC/model/evidence_set.h"

#include "dc/FastADC/model/evidence_set.h"

#include "util/bitset_extensions.h"

p-senichenkov · 2025-01-14T13:16:43Z

src/core/algorithms/dc/FastADC/util/predicate_organizer.h

+
+        for (auto const& evidence : evidence_set_) {
+            PredicateBitset bitset = evidence.evidence;
+            for (size_t i = bitset._Find_first(); i != bitset.size(); i = bitset._Find_next(i)) {


Suggested change

for (size_t i = bitset._Find_first(); i != bitset.size(); i = bitset._Find_next(i)) {

util::BitsetIterator<kPredicateBits> iter{bitset};

for (size_t i = iter.Pos(); i != bitset.size(); iter.Next(), i = iter.Pos()) {

p-senichenkov · 2025-01-14T13:38:51Z

src/core/algorithms/dc/FastADC/util/approximate_evidence_inverter.h

+            for (auto const& dc : unhit_evi_dcs) {
+                boost::dynamic_bitset<> unhit_cand = dc.cand & evi;
+                if (unhit_cand.any())
+                    new_candidates.Add(DCCandidate(dc.bitset, unhit_cand));


Suggested change

new_candidates.Add(DCCandidate(dc.bitset, unhit_cand));

new_candidates.Add(DCCandidate{dc.bitset, unhit_cand});

Parenthised aggregate initialization isn't currenlty supported by Apple Clang. Only braced one is allowed.

p-senichenkov · 2025-01-14T18:17:37Z

There's no other problems with Clang.

xJoskiy · 2025-01-12T23:36:29Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+
+#include <easylogging++.h>
+
+#include "builtin.h"


Suggested change

#include "builtin.h"

#include "model/types/builtin.h"

xJoskiy · 2025-01-12T23:37:02Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+#include <easylogging++.h>
+
+#include "builtin.h"
+#include "misc.h"


Suggested change

#include "misc.h"

#include "algorithms/dc/FastADC/misc/misc.h"

xJoskiy · 2025-01-12T23:37:43Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+
+#include "builtin.h"
+#include "misc.h"
+#include "table/column.h"


Suggested change

#include "table/column.h"

#include "model/table/column.h"

xJoskiy · 2025-01-12T23:41:01Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+#include "builtin.h"
+#include "misc.h"
+#include "table/column.h"
+#include "table/typed_column_data.h"


Suggested change

#include "table/typed_column_data.h"

#include "model/table/typed_column_data.h"

xJoskiy · 2025-01-12T23:41:41Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+#include "misc.h"
+#include "table/column.h"
+#include "table/typed_column_data.h"
+#include "type.h"


Suggested change

#include "type.h"

#include "model/types/type.h"

xJoskiy · 2025-01-14T18:12:05Z

src/core/algorithms/dc/FastADC/model/denial_constraint.h

+        sb << c_not << "{ ";
+        bool first = true;
+        for (PredicatePtr predicate : predicate_set_) {
+            if (!first) {
+                sb << c_and;
+            }
+            sb << predicate->ToString();
+            first = false;
+        }


Suggested change

sb << c_not << "{ ";

bool first = true;

for (PredicatePtr predicate : predicate_set_) {

if (!first) {

sb << c_and;

}

sb << predicate->ToString();

first = false;

}

sb << c_not << "{ ";

sb << predicate_set_.front().ToString();

for (auto it = std::next(predicate_set_.begin()); it != predicate_set_.end(); ++it) {

sb << c_and << it->ToString();

}

xJoskiy · 2025-01-14T18:15:00Z

src/core/algorithms/dc/FastADC/model/evidence.h

+        size_t pos = 0;
+        while (tmp.any()) {
+            if (tmp.test(0)) {
+                evidence ^= correctionMap[pos];
+            }
+            tmp >>= 1;
+            pos++;
+        }


Suggested change

size_t pos = 0;

while (tmp.any()) {

if (tmp.test(0)) {

evidence ^= correctionMap[pos];

}

tmp >>= 1;

pos++;

}

for (size_t pos = 0; tmp.any(); ++pos) {

if (tmp.test(0)) {

evidence ^= correctionMap[pos];

}

tmp >>= 1;

}

Magic number btw

xJoskiy · 2025-01-14T18:15:28Z

src/core/algorithms/dc/FastADC/model/evidence_set.h

+
+#include <easylogging++.h>
+
+#include "evidence.h"


Use full name

xJoskiy · 2025-01-14T18:17:35Z

src/core/algorithms/dc/FastADC/model/operator.cpp

+#include "operator.h"
+
+#include <utility>
+
+#include "builtin.h"
+#include "type.h"


Full names also here

xJoskiy · 2025-01-14T18:25:26Z

src/core/algorithms/dc/FastADC/model/operator.cpp

+Operator::OperatorMap<OperatorType> Operator::InitializeInverseMap() {
+    return {{OperatorType::kEqual, OperatorType::kUnequal},
+            {OperatorType::kUnequal, OperatorType::kEqual},
+            {OperatorType::kGreater, OperatorType::kLessEqual},
+            {OperatorType::kLess, OperatorType::kGreaterEqual},
+            {OperatorType::kGreaterEqual, OperatorType::kLess},
+            {OperatorType::kLessEqual, OperatorType::kGreater}};
+}


As a result of a previous discussion #417 (comment), those static fields should be declared as constexpr and frozen::unordered_map should be utilized.

For inspiration look at "algorithms/dc/model/operator.h"

xJoskiy · 2025-01-15T07:04:13Z

src/core/algorithms/dc/FastADC/model/predicate.cpp

+
+std::vector<PredicatePtr> const& Predicate::GetImplications(PredicateProvider* provider) const {
+    if (implications_.empty()) {
+        auto op_implications = op_.GetImplications();


Auto type is not obviuos

xJoskiy · 2025-01-15T07:04:52Z

src/core/algorithms/dc/FastADC/model/predicate.cpp

+}
+
+std::vector<PredicatePtr> const& Predicate::GetImplications(PredicateProvider* provider) const {
+    if (implications_.empty()) {


Use the same style as above

Suggested change

if (implications_.empty()) {

if (!implications_) {

xJoskiy · 2025-01-15T07:06:59Z

src/core/algorithms/dc/FastADC/model/predicate.h

+#include "column_operand.h"
+#include "operator.h"
+#include "table/typed_column_data.h"


Full names also here and everywhere else

xJoskiy · 2025-01-15T11:35:56Z

src/core/algorithms/dc/FastADC/model/pli_shard.cpp

+                plis.push_back(
+                        BuildPli(hashed_input[col], input[col].IsNumeric(), shard_beg, shard_end));


Such complex syntax impairs readability. Declare a variable and then move

xJoskiy · 2025-01-15T11:36:31Z

src/core/algorithms/dc/FastADC/model/pli_shard.cpp

+        clusters[cluster_id].push_back(row);
+    }
+
+    return Pli(clusters, keys, key_to_cluster_id);


Suggested change

return Pli(clusters, keys, key_to_cluster_id);

return {clusters, keys, key_to_cluster_id};

xJoskiy · 2025-01-15T11:56:29Z

src/core/algorithms/dc/FastADC/providers/predicate_provider.h

+    /** Create predicate object and return pointer to it or obtain it from cache */
+    PredicatePtr GetPredicate(Operator const& op, ColumnOperand const& left,
+                              ColumnOperand const& right) {
+        auto [iter, _] = predicates_[op][left].try_emplace(right, op, left, right);
+        return &iter->second;
+    }


Is Predicate really such a complex structure that it's worth caching? Or an algorithm supposes creating a huge amount of predicates?

xJoskiy · 2025-01-15T12:07:02Z

src/core/algorithms/dc/FastADC/util/approximate_evidence_inverter.h

+    boost::dynamic_bitset<> result(lhs);
+    result &= rhs;
+    return result;


Why not simply

Suggested change

boost::dynamic_bitset<> result(lhs);

result &= rhs;

return result;

return lhs & rhs;

And then operator& is already supported for these type

xJoskiy · 2025-01-15T12:13:54Z

src/core/algorithms/dc/FastADC/util/approximate_evidence_inverter.h

+        for (; e < evidences_.size(); ++e) {
+            if (!(IsSubset(dc, evidences_[e].evidence))) {
+                target -= evidences_[e].count;
+                if (target <= 0) return true;
+            }
+        }


Seems little bit more concise

Suggested change

for (; e < evidences_.size(); ++e) {

if (!(IsSubset(dc, evidences_[e].evidence))) {

target -= evidences_[e].count;

if (target <= 0) return true;

}

}

for (; e < evidences_.size(); ++e) {

if (IsSubset(dc, evidences_[e].evidence)) continue;

target -= evidences_[e].count;

if (target <= 0) return true;

}

xJoskiy · 2025-01-15T12:21:41Z

src/core/algorithms/dc/FastADC/util/closure.h

+    bool TransitivityStep() {
+        std::unordered_set<PredicatePtr> additions;
+
+        // Add implications and symmetric implications to additions
+        std::for_each(closure_.begin(), closure_.end(), [&](PredicatePtr p) {
+            if (p->GetSymmetric(provider_) != nullptr) {
+                auto const& sym_implications =
+                        p->GetSymmetric(provider_)->GetImplications(provider_);
+                additions.insert(sym_implications.begin(), sym_implications.end());
+            }
+            auto const& implications = p->GetImplications(provider_);
+            additions.insert(implications.begin(), implications.end());
+        });
+
+        for (auto const& [op, list] : grouped_) {
+            for (Operator op_trans : op.GetTransitives()) {
+                auto const p_trans_it = grouped_.find(op_trans);
+                if (p_trans_it == grouped_.end()) continue;
+
+                std::vector<PredicatePtr> const& p_trans = p_trans_it->second;
+
+                for (PredicatePtr p : list) {
+                    for (PredicatePtr p2 : p_trans) {
+                        if (p == p2) continue;
+
+                        // Transitive inference: A -> B; B -> C
+                        if (p->GetRightOperand() == p2->GetLeftOperand()) {
+                            PredicatePtr new_pred = provider_->GetPredicate(op, p->GetLeftOperand(),
+                                                                            p2->GetRightOperand());
+                            additions.insert(new_pred);
+                        }
+
+                        // Transitive inference: C -> A; A -> B
+                        if (p2->GetRightOperand() == p->GetLeftOperand()) {
+                            PredicatePtr new_pred = provider_->GetPredicate(
+                                    op, p2->GetLeftOperand(), p->GetRightOperand());
+                            additions.insert(new_pred);
+                        }
+                    }
+                }
+            }
+        }
+
+        // Handle special cases for operators
+        auto const& uneq_list_it = grouped_.find(OperatorType::kUnequal);
+        if (uneq_list_it != grouped_.end()) {
+            for (PredicatePtr p : uneq_list_it->second) {
+                if (closure_.Contains(provider_->GetPredicate(
+                            OperatorType::kLessEqual, p->GetLeftOperand(), p->GetRightOperand()))) {
+                    additions.insert(provider_->GetPredicate(
+                            OperatorType::kLess, p->GetLeftOperand(), p->GetRightOperand()));
+                }
+                if (closure_.Contains(provider_->GetPredicate(OperatorType::kGreaterEqual,
+                                                              p->GetLeftOperand(),
+                                                              p->GetRightOperand()))) {
+                    additions.insert(provider_->GetPredicate(
+                            OperatorType::kGreater, p->GetLeftOperand(), p->GetRightOperand()));
+                }
+            }
+        }
+
+        auto const& leq_list_it = grouped_.find(OperatorType::kLessEqual);
+        if (leq_list_it != grouped_.end()) {
+            for (PredicatePtr p : leq_list_it->second) {
+                if (closure_.Contains(provider_->GetPredicate(OperatorType::kGreaterEqual,
+                                                              p->GetLeftOperand(),
+                                                              p->GetRightOperand()))) {
+                    additions.insert(provider_->GetPredicate(
+                            OperatorType::kEqual, p->GetLeftOperand(), p->GetRightOperand()));
+                }
+            }
+        }
+
+        // Add all newly inferred predicates
+        return AddAll(additions);
+    }


Seems too complex, definitely should be split into several functions

xJoskiy · 2025-01-15T12:24:09Z

src/core/algorithms/dc/FastADC/util/cross_clue_set_builder.cpp

+    auto const& pivot_keys = pivotPli.GetKeys();
+
+    for (size_t i = 0; i < pivot_keys.size(); ++i) {
+        size_t j;


Explicit initialization

xJoskiy · 2025-01-15T12:38:02Z

src/core/algorithms/dc/FastADC/util/single_clue_set_builder.cpp

+    auto const& pivot_clusters = pivotPli.GetClusters();
+    auto const& probe_clusters = probePli.GetClusters();
+    auto const& pivot_keys = pivotPli.GetKeys();


Not obvious auto type

xJoskiy · 2025-01-15T12:38:18Z

src/core/algorithms/dc/FastADC/util/single_clue_set_builder.cpp

+    auto const& pivot_keys = pivotPli.GetKeys();
+
+    for (size_t i = 0; i < pivot_keys.size(); ++i) {
+        size_t j;


Explicit initialization

xJoskiy · 2025-01-15T12:38:41Z

src/core/algorithms/dc/FastADC/util/single_clue_set_builder.cpp

+    auto const& pivot_keys = pivotPli.GetKeys();
+    auto const& probe_keys = probePli.GetKeys();


Not obvious auto type

xJoskiy · 2025-01-15T12:41:03Z

src/python_bindings/dc/bind_fastadc.cpp

+namespace {
+namespace py = pybind11;
+}  // namespace


Why anonymous namespace?

xJoskiy · 2025-01-15T12:42:10Z

src/core/config/names.h

@@ -1,5 +1,6 @@
 #pragma once

+#include "descriptions.h"


Is it used anywhere?

xJoskiy · 2025-01-19T10:40:30Z

src/core/algorithms/dc/FastADC/fastadc.cpp

+#include "descriptions.h"
+#include "names.h"


Suggested change

#include "descriptions.h"

#include "names.h"

#include "config/names_and_descriptions.h"

xJoskiy · 2025-01-19T13:18:33Z

src/core/algorithms/dc/FastADC/fastadc.cpp

+                    "\" is of unsupported type. Only numeric and string types are supported.");
+        }
+
+        for (std::size_t row_index = 0; row_index < rows_num; row_index++) {


Use prefix increment
https://google.github.io/styleguide/cppguide.html#Preincrement_and_Predecrement:~:text=Use%20prefix%20increment/decrement%2C%20unless%20the%20code%20explicitly%20needs%20the%20result%20of%20the%20postfix%20increment/decrement%20expression.

Suggested change

for (std::size_t row_index = 0; row_index < rows_num; row_index++) {

for (std::size_t row_index = 0; row_index < rows_num; ++row_index) {

xJoskiy · 2025-01-19T13:22:07Z

src/core/algorithms/dc/FastADC/fastadc.cpp

+            if (column.IsNull(row_index)) {
+                throw std::runtime_error("Some of the value coordinates are nulls.");
+            }
+            if (column.IsEmpty(row_index)) {
+                throw std::runtime_error("Some of the value coordinates are empty.");
+            }


If it's not important you may combine with column.IsNullOrEmpty()

Suggested change

if (column.IsNull(row_index)) {

throw std::runtime_error("Some of the value coordinates are nulls.");

}

if (column.IsEmpty(row_index)) {

throw std::runtime_error("Some of the value coordinates are empty.");

}

if (column.IsNullOrEmpty(row_index)) {

throw std::runtime_error("Some of the value coordinates are null or empty.");

}

xJoskiy · 2025-01-19T13:26:37Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+            break;
+        default:
+            LOG(DEBUG) << "Column type  " << c1.GetType().ToString() << " is not numeric";
+            return -1;


Suggested change

return -1;

return -1.0;

xJoskiy · 2025-01-19T13:29:53Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+}
+
+double GetSharedPercentage(model::TypedColumnData const& c1, model::TypedColumnData const& c2) {
+    if (c1.GetColumn() == c2.GetColumn()) return 1.;


Suggested change

if (c1.GetColumn() == c2.GetColumn()) return 1.;

if (c1.GetColumn() == c2.GetColumn()) return 1.0;

xJoskiy · 2025-01-19T13:30:11Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+            LOG(DEBUG) << "Column " << c1.GetColumn()->ToString() << " with type "
+                       << c1.GetType().ToString()
+                       << " is not supported for shared percentage calculation";
+            return -1;


Suggested change

return -1;

return -1.0;

xJoskiy · 2025-01-19T13:30:34Z

src/core/algorithms/dc/FastADC/misc/typed_column_data_value_differences.cpp

+}
+
+double GetAverageRatio(model::TypedColumnData const& c1, model::TypedColumnData const& c2) {
+    if (c1.GetColumn() == c2.GetColumn()) return 1.;


Suggested change

if (c1.GetColumn() == c2.GetColumn()) return 1.;

if (c1.GetColumn() == c2.GetColumn()) return 1.0;

xJoskiy · 2025-01-19T13:36:50Z

src/core/algorithms/dc/FastADC/util/denial_constraint_set.h

+    static int CompareBitsets(boost::dynamic_bitset<> const& lhs,
+                              boost::dynamic_bitset<> const& rhs) {
+        size_t max_size = std::max(lhs.size(), rhs.size());
+
+        for (size_t i = 0; i < max_size; ++i) {
+            bool lhs_bit = i < lhs.size() ? lhs[i] : false;
+            bool rhs_bit = i < rhs.size() ? rhs[i] : false;
+
+            if (lhs_bit != rhs_bit) {
+                return lhs_bit ? 1 : -1;  // lhs_bit is true and rhs_bit is false => lhs > rhs
+            }
+        }
+        return 0;  // bitsets are equal
+    }


I think it is worth using CompareResult enum defined in model/types/builtin.h

xJoskiy · 2025-01-19T16:14:14Z

src/core/algorithms/dc/FastADC/model/predicate_set.cpp

+    std::string result = "{ ";
+    for (PredicatePtr predicate : *this) {
+        result += predicate->ToString() + " ";
+    }
+    result += "}";
+    return result;


Suggested change

std::string result = "{ ";

for (PredicatePtr predicate : *this) {

result += predicate->ToString() + " ";

}

result += "}";

return result;

std::stringstream ss;

ss << "{ "

for (PredicatePtr predicate : *this) {

ss << predicate->ToString() << ' ';

}

ss << '}';

return ss.str();

xJoskiy · 2025-01-19T16:25:05Z

src/core/algorithms/dc/FastADC/model/denial_constraint.h

+#include <sstream>
+#include <string>
+
+#include "predicate_set.h"


Use full name

xJoskiy · 2025-01-19T16:26:36Z

src/core/algorithms/dc/FastADC/model/denial_constraint.h

+
+#include <sstream>
+#include <string>
+


IWYU, add dynamic_bitset header and predicate.h

xJoskiy · 2025-01-19T16:28:27Z

src/core/algorithms/dc/FastADC/model/predicate_set.h

+#include <boost/move/utility_core.hpp>
+
+#include "dc/FastADC/providers/index_provider.h"
+#include "predicate.h"


xJoskiy · 2025-01-19T16:43:26Z

src/core/algorithms/dc/FastADC/util/approximate_evidence_inverter.h

+        std::sort(evidences_.begin(), evidences_.end(),
+                  [](Evidence const& o1, Evidence const& o2) { return o2.count < o1.count; });


Declare a lambda to improve readablitiy, change the order of operands in comparison so it's clear that the sequence is sorted in descending order.

Suggested change

std::sort(evidences_.begin(), evidences_.end(),

[](Evidence const& o1, Evidence const& o2) { return o2.count < o1.count; });

auto cmp = [](Evidence const& o1, Evidence const& o2) { return o1.count > o2.count; };

std::sort(evidences_.begin(), evidences_.end(), cmp);

xJoskiy · 2025-01-19T17:24:06Z

src/core/algorithms/dc/FastADC/util/ntree_search.h

+public:
+    bool Add(boost::dynamic_bitset<> const& bs) {
+        Add(bs, bs.find_first());
+        return true;


Is it necessary if it always returns true?

xJoskiy · 2025-01-19T17:25:44Z

src/core/algorithms/dc/FastADC/util/ntree_search.h

+                boost::dynamic_bitset<> const* res =
+                        it->second->GetSubset(add, add.find_next(next_bit));


Create a variable for returned value, it will improve readablitiy

xJoskiy · 2025-01-19T17:28:37Z

src/core/algorithms/dc/FastADC/util/predicate_builder.cpp

+        AddAndCategorizePredicate(ColumnOperand(input[i].GetColumn(), ColumnOperandTuple::t),
+                                  ColumnOperand(input[j].GetColumn(), ColumnOperandTuple::s),
+                                  comparable);


Suggested change

AddAndCategorizePredicate(ColumnOperand(input[i].GetColumn(), ColumnOperandTuple::t),

ColumnOperand(input[j].GetColumn(), ColumnOperandTuple::s),

comparable);

auto t_col_op = ColumnOperand(input[i].GetColumn(), ColumnOperandTuple::t);

auto s_col_op = ColumnOperand(input[j].GetColumn(), ColumnOperandTuple::s);

AddAndCategorizePredicate(t_col_op, s_col_op, comparable);

xJoskiy · 2025-01-19T17:31:10Z

src/core/algorithms/dc/FastADC/util/predicate_organizer.h

+        std::stable_sort(indexes.begin(), indexes.end(),
+                         [&coverages](int i, int j) { return coverages[i] < coverages[j]; });


Suggested change

std::stable_sort(indexes.begin(), indexes.end(),

[&coverages](int i, int j) { return coverages[i] < coverages[j]; });

auto cmp = [&coverages](int i, int j) { return coverages[i] < coverages[j]; }

std::stable_sort(indexes.begin(), indexes.end(), cmp);

xJoskiy · 2025-01-19T17:32:07Z

src/core/algorithms/dc/FastADC/util/single_clue_set_builder.cpp

+void SingleClueSetBuilder::CorrectNumSingle(std::vector<Clue>& clues, Pli const& pli,
+                                            Clue const& eqMask, Clue const& gtMask) {
+    for (size_t i = 0; i < pli.Size(); ++i) {
+        auto const& cluster = pli.Get(i);


Not obvious auto type

ol-imorozko force-pushed the FastADC branch 4 times, most recently from f7a5ca2 to 3cc3060 Compare October 5, 2024 17:46

github-actions bot reviewed Oct 5, 2024

View reviewed changes

src/tests/test_dc_structures_correct_results.h Show resolved Hide resolved

polyntsov requested changes Nov 15, 2024

View reviewed changes

ol-imorozko force-pushed the FastADC branch 2 times, most recently from 928755d to f3a8b73 Compare December 2, 2024 20:14

ol-imorozko added 14 commits January 13, 2025 21:27

Replace int with int64_t in Predicate class

451c76f

TypedColumData kInt type is int64_t, and FastADC algorithm uses 64-bit long types

Initial commit that adds dc folder and placeholder for dc.h

82f8437

FastADC algorithm for mining approximate Denial Constraints will be implemented here.

Implement method to get value from TypedColumnData

3c50e82

Implement PrediateBuilder class

9e057cf

Generates and categorizes predicates for the future evidence set construction

Implement CommonClueSetBuilder

7753aed

This class organizes predicates into packs and creates a correction map, which will be used for optimizing predicate comparisons in derived clasees, that will actually build clues from PLIs

Implement SingleClueSetBuilder

effa8ba

Inherits from CommonClueSetBuilder and builds clues based from one PLI shard

Implement CrossClueSetBuilder

c99e02c

Inherits from CommonClueSetBuilder and builds clues based from two PLI shards

Add test that checks static fields of CommonClueSetBuilder

623ab63

Validates the number of bits in the clue, the structure of the predicate packs. And the correction map which stores predicate-to-bitset mappings

Implement ClueSetBuilder

7245877

This is a class for constructing clues from PliShards.

Add test that checks ClueSet building

eded4d8

The expected values are, once again, are taken from Java implementation

ol-imorozko force-pushed the FastADC branch from f3a8b73 to 919ae85 Compare January 13, 2025 18:27

github-actions bot reviewed Jan 13, 2025

View reviewed changes

src/core/algorithms/dc/FastADC/model/pli_shard.h Show resolved Hide resolved

ol-imorozko added 6 commits January 14, 2025 01:12

Add an ability to force kString type on TypedColumnData instead of kM…

e1ed44c

…ixed

Add initial EvidenceSetBuilder class that builds cardinality mask

8d5edc1

For now this class builds necessary structures to build Evidences later. The structures are clue set, correction map and cardinality mask.

Add test that verifies CardinalityMask

3943406

Implement Evidence

fbb6a86

This is class that maps 1to1 with Clue. The ApproximateEvidenceInversion algorithm (AEI) that will build approximate denial constraints is using Evidences as it's input

Implement EvidenceSet

fc63418

EvidenceSet is basically just a vector of evidences. The only thing that's adding is a method to get total count (I probably can publically inherit from std::vector<Evidence>...?)

Implement EvidenceSetBuilder

37e1136

Add the building of evidences

ol-imorozko added 10 commits January 14, 2025 03:22

Do not explicitly delete PredicateSet constructor

9d8d389

Make GetIndex in IndexProvider accept references

8ce8fab

Add inline to operator& and &= in aei.h to avoid violating ODR

c02e8ab

Don't capture all variables by references in aei.h

f98e9e1

Remove shared_ptrs in aei.h and simplify Hit method

36dfea9

Get rid of try-catch in clue builder by adding a helper function to c…

d0c362d

…heck first

Split variable definitions into several lines

26bfbd6

Add missing braces around if-else

0244e4d

Move BuildClueSet implementation to cpp file

2ba381e

Make CompareBitsets a static method of DenialConstraintSet

12b2881

ol-imorozko force-pushed the FastADC branch from 346eedc to e0afe8b Compare January 14, 2025 00:22

ol-imorozko added 4 commits January 14, 2025 14:57

Implement FastADC algorithm as a class derived from Algorithm

11aedfe

Add FastADC python bindings

88a7b1f

Correct clang-format issues

cdd22ed

Bring back missing tmp_dc.cvs and rename it appropriately

5e9835e

ol-imorozko force-pushed the FastADC branch from e0afe8b to 5e9835e Compare January 14, 2025 12:03

ol-imorozko changed the title ~~WIP: FastADC implementation~~ FastADC implementation Jan 14, 2025

p-senichenkov reviewed Jan 14, 2025

View reviewed changes

Add example of ADC mining

9f2a1d5

xJoskiy suggested changes Jan 14, 2025

View reviewed changes

xJoskiy reviewed Jan 15, 2025

View reviewed changes

xJoskiy suggested changes Jan 15, 2025

View reviewed changes

xJoskiy suggested changes Jan 19, 2025

View reviewed changes

		@@ -0,0 +1,98 @@
		#pragma once

		#include "dc/FastADC/model/evidence_set.h"

	#include "dc/FastADC/model/evidence_set.h"
	#include "dc/FastADC/model/evidence_set.h"
	#include "util/bitset_extensions.h"

	for (size_t i = bitset._Find_first(); i != bitset.size(); i = bitset._Find_next(i)) {
	util::BitsetIterator<kPredicateBits> iter{bitset};
	for (size_t i = iter.Pos(); i != bitset.size(); iter.Next(), i = iter.Pos()) {

	new_candidates.Add(DCCandidate(dc.bitset, unhit_cand));
	new_candidates.Add(DCCandidate{dc.bitset, unhit_cand});

	#include "misc.h"
	#include "algorithms/dc/FastADC/misc/misc.h"

	#include "table/typed_column_data.h"
	#include "model/table/typed_column_data.h"

		plis.push_back(
		BuildPli(hashed_input[col], input[col].IsNumeric(), shard_beg, shard_end));

	return Pli(clusters, keys, key_to_cluster_id);
	return {clusters, keys, key_to_cluster_id};

		auto const& pivot_keys = pivotPli.GetKeys();
		auto const& probe_keys = probePli.GetKeys();

	#include "descriptions.h"
	#include "names.h"
	#include "config/names_and_descriptions.h"

	for (std::size_t row_index = 0; row_index < rows_num; row_index++) {
	for (std::size_t row_index = 0; row_index < rows_num; ++row_index) {

	if (c1.GetColumn() == c2.GetColumn()) return 1.;
	if (c1.GetColumn() == c2.GetColumn()) return 1.0;

		std::sort(evidences_.begin(), evidences_.end(),
		[](Evidence const& o1, Evidence const& o2) { return o2.count < o1.count; });

		boost::dynamic_bitset<> const* res =
		it->second->GetSubset(add, add.find_next(next_bit));

		std::stable_sort(indexes.begin(), indexes.end(),
		[&coverages](int i, int j) { return coverages[i] < coverages[j]; });

FastADC implementation #470

Are you sure you want to change the base?

FastADC implementation #470

Conversation

ol-imorozko commented Oct 5, 2024 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

p-senichenkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p-senichenkov commented Jan 14, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xJoskiy Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xJoskiy Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ol-imorozko commented Oct 5, 2024 •

edited

Loading

xJoskiy Jan 15, 2025 •

edited

Loading

xJoskiy Jan 15, 2025 •

edited

Loading