Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement set based aod verifier, support aod mining in fastod #468

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
4267c23
Add ColumnIndexOption and move there ValidateIndex
polyntsov Sep 28, 2024
31f3c47
Add parameter to allow empty list of indices in IndicesOption
polyntsov Sep 28, 2024
1e76762
Move complex stripeed partition swap and create definitions to cpp
polyntsov Sep 28, 2024
20592c8
Refactor Swap in complex stripped partition
polyntsov Sep 28, 2024
4d65979
Introduce od::Ordering enum and use it instead of bool Ascending
polyntsov Sep 28, 2024
f043b22
Introduce partition type for complex stripped partition as create param
polyntsov Sep 28, 2024
b452c28
Accept in CreateAttributeSet any range as list of attributes
polyntsov Sep 29, 2024
789c617
Store DataFrame in ComplexStrippedPartition as raw pointer
polyntsov Sep 29, 2024
0214e5d
Store DataFrame directly as value in Fastod
polyntsov Sep 29, 2024
eef5077
Move nd_verifier's VectorToString to general util and accept any range
polyntsov Sep 29, 2024
7cff09d
Add missing <vector> include to config/iption.h
polyntsov Sep 30, 2024
cae91d2
Add method to convert fastod::AttributeSet to vector of column indices
polyntsov Sep 30, 2024
187a778
Add a callback to Option which is called before the option is set
polyntsov Sep 30, 2024
1f901a6
Implement getters for context and cols in canonical ods
polyntsov Sep 30, 2024
cdb4980
Implement a function to load algo data without configuring execute opts
polyntsov Sep 30, 2024
8b24e2f
Introduce is required callback to option
polyntsov Sep 30, 2024
d938c38
Allow absence of non-required options in algo factory
polyntsov Sep 30, 2024
fa7af04
Implement aod verifier and cover it with tests
polyntsov Sep 30, 2024
7d04aea
Implement python bindings to set based aod verifier
polyntsov Sep 30, 2024
60c2927
Implement error parameter for fastod and add tests for aod mining
polyntsov Sep 30, 2024
df33951
Specify in readme that we now support approximate set-based ODs
polyntsov Sep 30, 2024
4c350d0
Implement aod verification python example
polyntsov Sep 30, 2024
56a0f31
Avoid unnecessary copying of partitions in fastod partition cache
polyntsov Sep 30, 2024
a0e7af5
Don't store indices vectors via shared ptr in fastod complex partition
polyntsov Sep 30, 2024
3f6ec49
Fallback to split and swap validation when error is zero in canonical od
polyntsov Oct 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The currently supported data patterns are:
* Conditional functional dependencies (discovery)
* Inclusion dependencies (discovery)
* Order dependencies:
- set-based axiomatization (discovery)
- set-based axiomatization (discovery and validation including approximate)
- list-based axiomatization (discovery)
* Metric functional dependencies (validation)
* Fuzzy algebraic constraints (discovery)
Expand Down
2 changes: 1 addition & 1 deletion README_PYPI.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ The currently supported data patterns are:
* Conditional functional dependencies (discovery)
* Inclusion dependencies (discovery)
* Order dependencies:
- set-based axiomatization (discovery)
- set-based axiomatization (discovery and validation including approximate)
- list-based axiomatization (discovery)
* Metric functional dependencies (validation)
* Fuzzy algebraic constraints (discovery)
Expand Down
78 changes: 78 additions & 0 deletions examples/basic/verifying_aod.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
import desbordante
import pandas as pd
from tabulate import tabulate
import textwrap

def prints(str):
print(textwrap.fill(str, 80))

def print_data_frame(data_frame, title = None):
print_table(data_frame, 'keys', title)

def print_table(table, headers = None, title = None):
if title is not None:
print(title)

print(tabulate(table, headers=headers, tablefmt='psql'))


table = pd.read_csv('examples/datasets/salary.csv')
algo = desbordante.aod_verification.algorithms.Default()
algo.load_data(table=table)

prints("This example verifies set-based ODs.")
prints("""Please take a look at set-based ODs mining example first
(examples/basic/mining_set_od1.py).""")
print()
print_data_frame(table)
print()
prints("Let's start by verifying exact OC holding on the table above.")
prints("""One example of such OC is `{1} : 2<= ~ 3<=` (if you don't understand why it
holds, please take a look at examples/basic/mining_set_od1.py).""")
print()

# Indices are zero-based, this is why we're subtracting one
algo.execute(oc_context=[0], oc_left_index=1, oc_right_index=2, left_ordering='ascending')
prints(f"""OC {{1}}: 2<= ~ 3<= holds exactly: {algo.holds()}, removal set: {algo.get_removal_set()},
error: {algo.get_error()}""")
prints("""Note that error is zero and removal set is empty. Removal set is a set of rows which
should be removed in order for OC (or OD) to holds exactly. In this case OC holds exactly and
that's why the set is empty.""")

print()

prints("Now let's verify OFD {2} : [] -> 1<= which also holds exactly.")
print()
algo.execute(ofd_context=[1], ofd_right_index=0)
prints(f"""OFD {{2}}: [] -> 1<= holds exactly: {algo.holds()}, removal set: {algo.get_removal_set()},
error: {algo.get_error()}""")
prints("Note once again that error is zero and removal set is empty because OFD holds exactly")

print()
print("Now let's add some lines to the table to break exact holding of dependencies.")
table.loc[8] = [2020, 50, 9000]
print_data_frame(table)

# Need to recreate algo object since currently calling load_data() twice is not supported yet
algo = desbordante.aod_verification.algorithms.Default()
algo.load_data(table=table)
algo.execute(oc_context=[0], oc_left_index=1, oc_right_index=2, left_ordering='ascending')
prints(f"""OC {{1}}: 2<= ~ 3<= holds exactly: {algo.holds()}, removal set: {algo.get_removal_set()},
error: {algo.get_error()}""")
prints("""Note that now OC doesn't hold exactly and that removal set is {4}. This means that
in order for OC to hold exactly, it's enough to remove from the table line number 4 (indexed from 0).
Note that lines 8 and 4 are interchangable in that sense, because the problem with ordering is
caused by their simultaneous presence in the table and removing any of them will fix it. Algorithm
guarantees to return a minimal removal set in terms of size, but doesn't specify which one exactly
if there are several possible.""")

print()
algo.execute(ofd_context=[1], ofd_right_index=0)
prints(f"""OFD {{2}}: [] -> 1<= holds exactly: {algo.holds()}, removal set: {algo.get_removal_set()},
error: {algo.get_error()}""")
prints("""Note once again that the OFD does not hold exactly anymore and that removal set is not
empty. By adding line 8 with the same value in column 2 as in line 5, but different values in column
1 we broke FD 2->1 and thus broke OFD {2}: [] -> 1<=. Removing any of these two lines will make the
OFD hold exactly, thus removal set is {5}.
""")

6 changes: 5 additions & 1 deletion src/core/algorithms/algo_factory.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ void ConfigureFromMap(Algorithm& algorithm, StdParamsMap const& options) {
});
}

void LoadAlgorithm(Algorithm& algorithm, StdParamsMap const& options) {
void LoadAlgorithmData(Algorithm& algorithm, StdParamsMap const& options) {
ConfigureFromFunction(algorithm, [&options](std::string_view option_name) {
using namespace config::names;
auto create_input_table = [](CSVConfig const& csv_config) -> config::InputTable {
Expand All @@ -69,6 +69,10 @@ void LoadAlgorithm(Algorithm& algorithm, StdParamsMap const& options) {
return GetOrEmpty(options, option_name);
});
algorithm.LoadData();
}

void LoadAlgorithm(Algorithm& algorithm, StdParamsMap const& options) {
LoadAlgorithmData(algorithm, options);
ConfigureFromMap(algorithm, options);
}

Expand Down
16 changes: 15 additions & 1 deletion src/core/algorithms/algo_factory.h
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,27 @@ template <typename FuncType>
void ConfigureFromFunction(Algorithm& algorithm, FuncType get_opt_value_by_name) {
std::unordered_set<std::string_view> needed;
while (!(needed = algorithm.GetNeededOptions()).empty()) {
std::vector<std::string_view> needed_but_empty;
for (std::string_view option_name : needed) {
algorithm.SetOption(option_name, get_opt_value_by_name(option_name));
boost::any value = get_opt_value_by_name(option_name);
if (value.empty()) {
needed_but_empty.push_back(option_name);
continue;
}
algorithm.SetOption(option_name, value);
Comment on lines +24 to +28
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (value.empty()) {
needed_but_empty.push_back(option_name);
continue;
}
algorithm.SetOption(option_name, value);
if (value.empty())
needed_but_empty.push_back(option_name);
else
algorithm.SetOption(option_name, value);

}

// After we set some other options these options may become non-required
for (std::string_view option_name : needed_but_empty) {
if (algorithm.OptionIsRequired(option_name)) {
algorithm.SetOption(option_name, boost::any{});
}
}
}
}

void ConfigureFromMap(Algorithm& algorithm, StdParamsMap const& options);
void LoadAlgorithmData(Algorithm& algorithm, StdParamsMap const& options);
void LoadAlgorithm(Algorithm& algorithm, StdParamsMap const& options);

template <typename T>
Expand Down
31 changes: 28 additions & 3 deletions src/core/algorithms/algorithm.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#include "algorithms/algorithm.h"

#include <algorithm>
#include <cassert>

#include "config/exceptions.h"
Expand All @@ -11,6 +12,10 @@ bool Algorithm::SetExternalOption([[maybe_unused]] std::string_view option_name,
return false;
}

bool Algorithm::ExternalOptionIsRequired([[maybe_unused]] std::string_view option_name) const {
return false;
}

void Algorithm::AddSpecificNeededOptions(
[[maybe_unused]] std::unordered_set<std::string_view>& previous_options) const {}

Expand Down Expand Up @@ -60,8 +65,15 @@ void Algorithm::MakeOptionsAvailable(std::vector<std::string_view> const& option
}
}

bool Algorithm::AllRequiredOptionsAreSet() const noexcept {
std::unordered_set<std::string_view> needed = GetNeededOptions();
return std::none_of(needed.begin(), needed.end(), [this](std::string_view option_name) {
return possible_options_.at(option_name)->IsRequired();
});
}
Comment on lines +68 to +73
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the following code will be the same, but when I look at the current implementaion I don't like that All...AreSet method uses none_of instead of all_of.

Shouldn't we do all_of over avaliable_options_ here and return opt->IsRequired() && opt->IsSet()?


void Algorithm::LoadData() {
if (!GetNeededOptions().empty())
if (!AllRequiredOptionsAreSet())
throw std::logic_error("All options need to be set before starting processing.");
LoadDataInternal();
ExecutePrepare();
Expand All @@ -71,7 +83,7 @@ unsigned long long Algorithm::Execute() {
if (!data_loaded_) {
throw std::logic_error("Data must be processed before execution.");
}
if (!GetNeededOptions().empty())
if (!AllRequiredOptionsAreSet())
throw std::logic_error("All options need to be set before execution.");
progress_.ResetProgress();
ResetState();
Expand Down Expand Up @@ -112,10 +124,23 @@ void Algorithm::SetOption(std::string_view option_name, boost::any const& value)
child_opts.insert(child_opts.end(), new_opts.begin(), new_opts.end());
}

bool Algorithm::OptionIsRequired(std::string_view option_name) const {
if (bool ext_opt_is_required = ExternalOptionIsRequired(option_name); ext_opt_is_required) {
return true;
}
Comment on lines +128 to +130
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

Suggested change
if (bool ext_opt_is_required = ExternalOptionIsRequired(option_name); ext_opt_is_required) {
return true;
}
if (ExternalOptionIsRequired(option_name)) {
return true;
}


auto it = possible_options_.find(option_name);
if (it == possible_options_.end()) {
return false;
}
return it->second->IsRequired();
Comment on lines +133 to +136
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just

Suggested change
if (it == possible_options_.end()) {
return false;
}
return it->second->IsRequired();
return it != possible_options_.end() && it->second->IsRequired();

}

std::unordered_set<std::string_view> Algorithm::GetNeededOptions() const {
std::unordered_set<std::string_view> needed{};
for (std::string_view name : available_options_) {
if (!possible_options_.at(name)->IsSet()) {
if (std::unique_ptr<config::IOption> const& opt = possible_options_.at(name);
!opt->IsSet() && opt->IsRequired()) {
needed.insert(name);
}
}
Expand Down
3 changes: 3 additions & 0 deletions src/core/algorithms/algorithm.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce is required callback to option change commit title to something like Introduce IsRequired callback to option

Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ class Algorithm {
void ClearOptions() noexcept;
virtual void LoadDataInternal() = 0;
virtual unsigned long long ExecuteInternal() = 0;
bool AllRequiredOptionsAreSet() const noexcept;

protected:
void AddProgress(double val) noexcept {
Expand All @@ -64,6 +65,7 @@ class Algorithm {
// Overload this if you want to work with options outside of
// possible_options_ map. Useful for pipelines.
virtual bool SetExternalOption(std::string_view option_name, boost::any const& value);
virtual bool ExternalOptionIsRequired(std::string_view option_name) const;
virtual void AddSpecificNeededOptions(
std::unordered_set<std::string_view>& previous_options) const;
void ExecutePrepare();
Expand All @@ -90,6 +92,7 @@ class Algorithm {
unsigned long long Execute();

void SetOption(std::string_view option_name, boost::any const& value = {});
bool OptionIsRequired(std::string_view option_name) const;

[[nodiscard]] std::unordered_set<std::string_view> GetNeededOptions() const;

Expand Down
1 change: 0 additions & 1 deletion src/core/algorithms/fd/fd_verifier/dynamic_fd_verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@

#include "config/equal_nulls/option.h"
#include "config/indices/option.h"
#include "config/indices/validate_index.h"
#include "config/names_and_descriptions.h"
#include "config/option_using.h"
#include "config/tabular_data/crud_operations/operations.h"
Expand Down
1 change: 0 additions & 1 deletion src/core/algorithms/fd/fd_verifier/fd_verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@

#include "config/equal_nulls/option.h"
#include "config/indices/option.h"
#include "config/indices/validate_index.h"
#include "config/names_and_descriptions.h"
#include "config/option_using.h"
#include "config/tabular_data/input_table/option.h"
Expand Down
4 changes: 4 additions & 0 deletions src/core/algorithms/ind/mind/mind.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,10 @@ bool Mind::SetExternalOption(std::string_view option_name, boost::any const& val
return false;
}

bool Mind::ExternalOptionIsRequired(std::string_view option_name) const {
return auind_algo_->OptionIsRequired(option_name);
}

void Mind::LoadINDAlgorithmDataInternal() {
timings_.load = util::TimedInvoke(&Algorithm::LoadData, auind_algo_);
}
Expand Down
1 change: 1 addition & 0 deletions src/core/algorithms/ind/mind/mind.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ class Mind final : public INDAlgorithm {
void AddSpecificNeededOptions(
std::unordered_set<std::string_view>& previous_options) const override;
bool SetExternalOption(std::string_view option_name, boost::any const& value) override;
bool ExternalOptionIsRequired(std::string_view option_name) const override;
void LoadINDAlgorithmDataInternal() override;

bool TestCandidate(RawIND const& raw_ind);
Expand Down
12 changes: 6 additions & 6 deletions src/core/algorithms/nd/nd.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

#include <algorithm>

#include "algorithms/nd/nd_verifier/util/vector_to_string.h"
#include "algorithms/nd/util/get_vertical_names.h"
#include "util/range_to_string.h"

namespace model {

Expand All @@ -16,14 +16,14 @@ namespace model {
}

std::string ND::ToShortString() const {
using namespace algos::nd_verifier::util;
return VectorToString(GetLhsIndices()) + " -> " + VectorToString(GetRhsIndices());
using namespace util;
return RangeToString(GetLhsIndices()) + " -> " + RangeToString(GetRhsIndices());
}

std::string ND::ToLongString() const {
using namespace algos::nd_verifier::util;
return VectorToString(GetLhsNames()) + " -" + std::to_string(GetWeight()) + "-> " +
VectorToString(GetRhsNames());
using namespace util;
return RangeToString(GetLhsNames()) + " -" + std::to_string(GetWeight()) + "-> " +
RangeToString(GetRhsNames());
}

[[nodiscard]] std::tuple<std::vector<std::string>, std::vector<std::string>, WeightType>
Expand Down
6 changes: 3 additions & 3 deletions src/core/algorithms/nd/nd_verifier/nd_verifier.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@

#include "algorithms/nd/nd_verifier/util/stats_calculator.h"
#include "algorithms/nd/nd_verifier/util/value_combination.h"
#include "algorithms/nd/nd_verifier/util/vector_to_string.h"
#include "config/descriptions.h"
#include "config/equal_nulls/option.h"
#include "config/indices/option.h"
Expand All @@ -23,6 +22,7 @@
#include "model/table/typed_column_data.h"
#include "model/types/builtin.h"
#include "model/types/type.h"
#include "util/range_to_string.h"
#include "util/timed_invoke.h"

namespace algos::nd_verifier {
Expand Down Expand Up @@ -62,8 +62,8 @@ unsigned long long NDVerifier::ExecuteInternal() {
LOG(INFO) << "Parameters of NDVerifier:";
LOG(INFO) << "\tInput table: " << input_table_->GetRelationName();
LOG(INFO) << "\tNull equals null: " << is_null_equal_null_;
LOG(INFO) << "\tLhs indices: " << util::VectorToString(lhs_indices_);
LOG(INFO) << "\tRhs indices: " << util::VectorToString(rhs_indices_);
LOG(INFO) << "\tLhs indices: " << ::util::RangeToString(lhs_indices_);
LOG(INFO) << "\tRhs indices: " << ::util::RangeToString(rhs_indices_);
LOG(INFO) << "\tWeight: " << weight_;

auto verification_time = ::util::TimedInvoke(&NDVerifier::VerifyND, this);
Expand Down
6 changes: 3 additions & 3 deletions src/core/algorithms/nd/nd_verifier/util/highlight.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
#include <vector>

#include "algorithms/nd/nd_verifier/util/value_combination.h"
#include "algorithms/nd/nd_verifier/util/vector_to_string.h"
#include "util/range_to_string.h"

namespace algos::nd_verifier::util {

Expand Down Expand Up @@ -100,14 +100,14 @@ std::vector<std::string> const& Highlight::GetRhsValues() {
}

[[nodiscard]] std::string Highlight::ToIndicesString() const {
return util::VectorToString(CalculateOccurencesIndices());
return ::util::RangeToString(CalculateOccurencesIndices());
}

[[nodiscard]] std::string Highlight::ToValuesString() const {
std::string const& lhs = GetLhsValue();
std::vector<std::string> const& rhs = CalculateRhsValues();

return lhs + " -> " + util::VectorToString(rhs);
return lhs + " -> " + ::util::RangeToString(rhs);
}

std::ostream& operator<<(std::ostream& os, Highlight const& hl) {
Expand Down
28 changes: 0 additions & 28 deletions src/core/algorithms/nd/nd_verifier/util/vector_to_string.h

This file was deleted.

Loading
Loading