Skip to content

Commit

Permalink
[FEA] AST filtering in parquet reader (#13348)
Browse files Browse the repository at this point in the history
The plan to support AST based filter predicate pushdown in parquet. This PR adds predicate pushdown on row group filtering. 

The statistics of columns of each row group are loaded to a device column, and AST filter is applied on min, max of each column to select the row groups to read. The user given AST needs to be converted to another AST to be applied on min, max values of each column ('Statistics AST'). After the row groups are parsed, the user given AST is applied on the output columns to filter any remaining rows in the row groups.
New `column_name_reference` is introduced to help the users create AST's that reference columns by name, as the user may or may not have the column indices information before reading. Since AST engine takes only column index reference, a transformation is applied to the user given AST. So, 2 new AST transformation classes are introduced: 
1. `named_to_reference_converter` - Converts column name references to column index references
2. `stats_expression_converter` - Converts the above output table filtering AST to 'Statistics AST'.

Note: This column_name_reference only supported for predicate pushdown filtering, but not supported for other AST operations such as transform, joins etc.

- [x] #13472 
- [x] Convert column chunk min, max to cudf type column.
- [x] Add AST filter interface to parquet reader options
- [x] Convert AST to Statistics AST
- [x] Apply statistics AST on Stats values to get row_groups
- [x] Apply AST as filter on output columns.

Depends on #13472

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #13348
  • Loading branch information
karthikeyann authored Jul 26, 2023
1 parent 2231b15 commit fa09cca
Show file tree
Hide file tree
Showing 21 changed files with 1,635 additions and 69 deletions.
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ outputs:
- test -f $PREFIX/lib/libcudf_identify_stream_usage_mode_testing.so
- test -f $PREFIX/include/cudf/aggregation.hpp
- test -f $PREFIX/include/cudf/ast/detail/expression_parser.hpp
- test -f $PREFIX/include/cudf/ast/detail/expression_transformer.hpp
- test -f $PREFIX/include/cudf/ast/detail/operators.hpp
- test -f $PREFIX/include/cudf/ast/expressions.hpp
- test -f $PREFIX/include/cudf/binaryop.hpp
Expand Down
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -395,6 +395,7 @@ add_library(
src/io/parquet/page_enc.cu
src/io/parquet/page_hdr.cu
src/io/parquet/page_string_decode.cu
src/io/parquet/predicate_pushdown.cpp
src/io/parquet/reader.cpp
src/io/parquet/reader_impl.cpp
src/io/parquet/reader_impl_helpers.cpp
Expand Down
27 changes: 8 additions & 19 deletions cpp/include/cudf/ast/detail/expression_parser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@
*/
#pragma once

#include <cudf/ast/detail/operators.hpp>
#include <cudf/ast/expressions.hpp>
#include <cudf/scalar/scalar_device_view.cuh>
#include <cudf/table/table_view.hpp>
#include <cudf/types.hpp>

#include <thrust/optional.h>
#include <thrust/scan.h>

#include <functional>
Expand Down Expand Up @@ -72,24 +72,6 @@ struct alignas(8) device_data_reference {
}
};

// Type trait for wrapping nullable types in a thrust::optional. Non-nullable
// types are returned as is.
template <typename T, bool has_nulls>
struct possibly_null_value;

template <typename T>
struct possibly_null_value<T, true> {
using type = thrust::optional<T>;
};

template <typename T>
struct possibly_null_value<T, false> {
using type = T;
};

template <typename T, bool has_nulls>
using possibly_null_value_t = typename possibly_null_value<T, has_nulls>::type;

// Type used for intermediate storage in expression evaluation.
template <bool has_nulls>
using IntermediateDataType = possibly_null_value_t<std::int64_t, has_nulls>;
Expand Down Expand Up @@ -193,6 +175,13 @@ class expression_parser {
*/
cudf::size_type visit(operation const& expr);

/**
* @brief Visit a column name reference expression.
*
* @param expr Column name reference expression.
* @return cudf::size_type Index of device data reference for the expression.
*/
cudf::size_type visit(column_name_reference const& expr);
/**
* @brief Internal class used to track the utilization of intermediate storage locations.
*
Expand Down
64 changes: 64 additions & 0 deletions cpp/include/cudf/ast/detail/expression_transformer.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@

/*
* Copyright (c) 2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include <cudf/ast/expressions.hpp>

namespace cudf::ast::detail {
/**
* @brief Base "visitor" pattern class with the `expression` class for expression transformer.
*
* This class can be used to implement recursive traversal of AST tree, and used to validate or
* translate an AST expression.
*/
class expression_transformer {
public:
/**
* @brief Visit a literal expression.
*
* @param expr Literal expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(literal const& expr) = 0;

/**
* @brief Visit a column reference expression.
*
* @param expr Column reference expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(column_reference const& expr) = 0;

/**
* @brief Visit an expression expression
*
* @param expr Expression expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(operation const& expr) = 0;

/**
* @brief Visit a column name reference expression.
*
* @param expr Column name reference expression
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> visit(column_name_reference const& expr) = 0;

virtual ~expression_transformer() {}
};
} // namespace cudf::ast::detail
20 changes: 20 additions & 0 deletions cpp/include/cudf/ast/detail/operators.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@
#include <cudf/utilities/error.hpp>
#include <cudf/utilities/type_dispatcher.hpp>

#include <thrust/optional.h>

#include <cuda/std/type_traits>

#include <cmath>
Expand All @@ -33,6 +35,24 @@ namespace ast {

namespace detail {

// Type trait for wrapping nullable types in a thrust::optional. Non-nullable
// types are returned as is.
template <typename T, bool has_nulls>
struct possibly_null_value;

template <typename T>
struct possibly_null_value<T, true> {
using type = thrust::optional<T>;
};

template <typename T>
struct possibly_null_value<T, false> {
using type = T;
};

template <typename T, bool has_nulls>
using possibly_null_value_t = typename possibly_null_value<T, has_nulls>::type;

// Traits for valid operator / type combinations
template <typename Op, typename LHS, typename RHS>
constexpr bool is_valid_binary_op = cuda::std::is_invocable_v<Op, LHS, RHS>;
Expand Down
72 changes: 71 additions & 1 deletion cpp/include/cudf/ast/expressions.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,8 @@ namespace ast {
// Forward declaration.
namespace detail {
class expression_parser;
}
class expression_transformer;
} // namespace detail

/**
* @brief A generic expression that can be evaluated to return a value.
Expand All @@ -46,6 +47,15 @@ struct expression {
*/
virtual cudf::size_type accept(detail::expression_parser& visitor) const = 0;

/**
* @brief Accepts a visitor class.
*
* @param visitor The `expression_transformer` transforming this expression tree
* @return Reference wrapper of transformed expression
*/
virtual std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const = 0;

/**
* @brief Returns true if the expression may evaluate to null.
*
Expand Down Expand Up @@ -305,6 +315,12 @@ class literal : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand Down Expand Up @@ -398,6 +414,12 @@ class column_reference : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand Down Expand Up @@ -458,6 +480,12 @@ class operation : public expression {
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
Expand All @@ -474,6 +502,48 @@ class operation : public expression {
std::vector<std::reference_wrapper<expression const>> const operands;
};

/**
* @brief A expression referring to data from a column in a table.
*/
class column_name_reference : public expression {
public:
/**
* @brief Construct a new column name reference object
*
* @param column_name Name of this column in the table metadata (provided when the expression is
* evaluated).
*/
column_name_reference(std::string column_name) : column_name(std::move(column_name)) {}

/**
* @brief Get the column name.
*
* @return The name of this column reference
*/
[[nodiscard]] std::string get_column_name() const { return column_name; }

/**
* @copydoc expression::accept
*/
cudf::size_type accept(detail::expression_parser& visitor) const override;

/**
* @copydoc expression::accept
*/
std::reference_wrapper<expression const> accept(
detail::expression_transformer& visitor) const override;

[[nodiscard]] bool may_evaluate_null(table_view const& left,
table_view const& right,
rmm::cuda_stream_view stream) const override
{
return true;
}

private:
std::string column_name;
};

} // namespace ast

} // namespace cudf
4 changes: 2 additions & 2 deletions cpp/include/cudf/detail/transform.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ std::unique_ptr<column> transform(column_view const& input,
*
* @param stream CUDA stream used for device memory operations and kernel launches.
*/
std::unique_ptr<column> compute_column(table_view const table,
ast::operation const& expr,
std::unique_ptr<column> compute_column(table_view const& table,
ast::expression const& expr,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

Expand Down
30 changes: 30 additions & 0 deletions cpp/include/cudf/io/parquet.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@

#pragma once

#include <cudf/ast/expressions.hpp>
#include <cudf/io/detail/parquet.hpp>
#include <cudf/io/types.hpp>
#include <cudf/table/table_view.hpp>
Expand Down Expand Up @@ -62,6 +63,9 @@ class parquet_reader_options {
// Number of rows to read; `nullopt` is all
std::optional<size_type> _num_rows;

// Predicate filter as AST to filter output rows.
std::optional<std::reference_wrapper<ast::expression const>> _filter;

// Whether to store string data as categorical type
bool _convert_strings_to_categories = false;
// Whether to use PANDAS metadata to load columns
Expand Down Expand Up @@ -160,6 +164,13 @@ class parquet_reader_options {
*/
[[nodiscard]] auto const& get_row_groups() const { return _row_groups; }

/**
* @brief Returns AST based filter for predicate pushdown.
*
* @return AST expression to use as filter
*/
[[nodiscard]] auto const& get_filter() const { return _filter; }

/**
* @brief Returns timestamp type used to cast timestamp columns.
*
Expand All @@ -181,6 +192,13 @@ class parquet_reader_options {
*/
void set_row_groups(std::vector<std::vector<size_type>> row_groups);

/**
* @brief Sets AST based filter for predicate pushdown.
*
* @param filter AST expression to use as filter
*/
void set_filter(ast::expression const& filter) { _filter = filter; }

/**
* @brief Sets to enable/disable conversion of strings to categories.
*
Expand Down Expand Up @@ -273,6 +291,18 @@ class parquet_reader_options_builder {
return *this;
}

/**
* @brief Sets vector of individual row groups to read.
*
* @param filter Vector of row groups to read
* @return this for chaining
*/
parquet_reader_options_builder& filter(ast::expression const& filter)
{
options.set_filter(filter);
return *this;
}

/**
* @brief Sets enable/disable conversion of strings to categories.
*
Expand Down
12 changes: 11 additions & 1 deletion cpp/src/ast/expression_parser.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2021, NVIDIA CORPORATION.
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -193,6 +193,16 @@ cudf::size_type expression_parser::visit(operation const& expr)
return index;
}

// TODO: Eliminate column name references from expression_parser because
// 2 code paths diverge in supporting column name references:
// 1. column name references are specific to cuIO
// 2. column name references are not supported in the libcudf table operations such as join,
// transform.
cudf::size_type expression_parser::visit(column_name_reference const& expr)
{
CUDF_FAIL("Column name references are not supported in the AST expression parser.");
}

cudf::data_type expression_parser::output_type() const
{
return _data_references.empty() ? cudf::data_type(cudf::type_id::EMPTY)
Expand Down
Loading

0 comments on commit fa09cca

Please sign in to comment.