Skip to content

Commit

Permalink
Adds the Finite-State Transducer algorithm (#11242)
Browse files Browse the repository at this point in the history
This PR adds a parallel _Finite-State Transducer_ (FST) algorithm. The FST is a key component of the nested JSON parser.

# Background


**An example of a Finite-State Transducer (FST) // aka the algorithm which we try to mimic**:
[Slides from the JSON parser presentation, Slides 11-17](https://docs.google.com/presentation/d/1NTQdUMM44NzzHxLNnvcGLQk6pI-fdoM3cXqNqushMbU/edit?usp=sharing)

## Our GPU-based implementation
**The GPU-based algorithm builds on the following work:**
[ParPaRaw: Massively Parallel Parsing of Delimiter-Separated Raw Data](https://arxiv.org/pdf/1905.13415.pdf)

**The following sections are of relevance:**
- Section 3.1
- Section 4.5 (i.e., the Multi-fragment in-register array)

**How the algorithm works is illustrated in the following presentation:**
[ParPaRaw @VlLDB'20](https://eliasstehle.com/media/parparaw_vldb_2020.pdf#page=21)

## Relevent Data Structures
**A word about the motivation and need for the _Multi-fragment in-register array_:**

The composition over to state-transaction vectors is a key operation (in the prefix scan). Basically, what it does for two state-transition vectors `lhs` and `rhs`, both comprising `N` items:
```
for (int32_t i = 0; i < N; ++i) {
  result[n] = rhs[lhs[i]];
}
return result;
```


The relevant part is the indexing into `rhs`: `rhs[lhs[i]]`, i.e., the index is `lhs[i]`, a runtime value that isn't known at compile time. It's important to understand that in CUB's prefix scan both `rhs` and `lhs` are thread-local variables. As such, they either live in the fast register file or in (slow off-chip) local memory. 
The register file has a shortcoming, it cannot be indexed dynamically. And here, we are dynamically indexing into `rhs`. So `rhs` will need to be spilled to local memory (backed by device memory) to allow for dynamic indexing. This would usually make the algorithm very slow. That's why we have the _Multi-fragment in-register array_. For its implementation details I'd suggest reading [Section 4.5](https://arxiv.org/pdf/1905.13415.pdf).

In contrast, the following example is fine and `foo` will be mapped to registers, because the loop can be unrolled, and, if `N` is known at compile time and sufficiently small (of at most tens of items).
```
// this is fine, if N is a compile-time constant 
for (int32_t i = 1; i < N; ++i) {
  foo[n] = foo[n-1];
}
```

# Style & CUB Integration

The following may be considered for being integrated into CUB at a later point, hence the deviation in style from cuDF.

- `in_reg_array.cuh`
- `agent_dfa.cuh`
- `device_dfa.cuh`
- `dispatch_dfa.cuh`

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Tobias Ribizel (https://github.com/upsj)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #11242
  • Loading branch information
elstehle authored Jul 22, 2022
1 parent a541ffb commit ebcea0f
Show file tree
Hide file tree
Showing 9 changed files with 2,206 additions and 4 deletions.
9 changes: 9 additions & 0 deletions cpp/include/cudf_test/cudf_gtest.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -176,3 +176,12 @@ struct TypeList<Types<TYPES...>> {
} catch (std::exception & e) { \
FAIL() << "statement:" << #statement << std::endl << "reason: " << e.what() << std::endl; \
}

/**
* @brief test macro comparing for equality of \p lhs and and \p rhs for the first \p size elements.
*/
#define CUDF_TEST_EXPECT_VECTOR_EQUAL(lhs, rhs, size) \
do { \
for (decltype(size) i = 0; i < size; i++) \
EXPECT_EQ(lhs[i], rhs[i]) << "Mismatch at index #" << i; \
} while (0)
672 changes: 672 additions & 0 deletions cpp/src/io/fst/agent_dfa.cuh

Large diffs are not rendered by default.

94 changes: 94 additions & 0 deletions cpp/src/io/fst/device_dfa.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

#include "dispatch_dfa.cuh"

#include <io/utilities/hostdevice_vector.hpp>

#include <cstdint>

namespace cudf::io::fst {

/**
* @brief Uses a deterministic finite automaton to transduce a sequence of symbols from an input
* iterator to a sequence of transduced output symbols.
*
* @tparam DfaT The DFA specification
* @tparam SymbolItT Random-access input iterator type to symbols fed into the FST
* @tparam TransducedOutItT Random-access output iterator to which the transduced output will be
* written
* @tparam TransducedIndexOutItT Random-access output iterator type to which the input symbols'
* indexes are written.
* @tparam TransducedCountOutItT A single-item output iterator type to which the total number of
* output symbols is written
* @tparam OffsetT A type large enough to index into either of both: (a) the input symbols and (b)
* the output symbols
* @param[in] d_temp_storage Device-accessible allocation of temporary storage. When NULL, the
* required allocation size is written to \p temp_storage_bytes and no work is done.
* @param[in,out] temp_storage_bytes Reference to size in bytes of \p d_temp_storage allocation
* @param[in] dfa The DFA specifying the number of distinct symbol groups, transition table, and
* translation table
* @param[in] d_chars_in Random-access input iterator to the beginning of the sequence of input
* symbols
* @param[in] num_chars The total number of input symbols to process
* @param[out] transduced_out_it Random-access output iterator to which the transduced output is
* written
* @param[out] transduced_out_idx_it Random-access output iterator to which, the index i is written
* iff the i-th input symbol caused some output to be written
* @param[out] d_num_transduced_out_it A single-item output iterator type to which the total number
* of output symbols is written
* @param[in] seed_state The DFA's starting state. For streaming DFAs this corresponds to the
* "end-state" of the previous invocation of the algorithm.
* @param[in] stream CUDA stream to launch kernels within. Default is the null-stream.
*/
template <typename DfaT,
typename SymbolItT,
typename TransducedOutItT,
typename TransducedIndexOutItT,
typename TransducedCountOutItT,
typename OffsetT>
cudaError_t DeviceTransduce(void* d_temp_storage,
size_t& temp_storage_bytes,
DfaT dfa,
SymbolItT d_chars_in,
OffsetT num_chars,
TransducedOutItT transduced_out_it,
TransducedIndexOutItT transduced_out_idx_it,
TransducedCountOutItT d_num_transduced_out_it,
uint32_t seed_state = 0,
cudaStream_t stream = 0)
{
using DispatchDfaT = detail::DispatchFSM<DfaT,
SymbolItT,
TransducedOutItT,
TransducedIndexOutItT,
TransducedCountOutItT,
OffsetT>;

return DispatchDfaT::Dispatch(d_temp_storage,
temp_storage_bytes,
dfa,
seed_state,
d_chars_in,
num_chars,
transduced_out_it,
transduced_out_idx_it,
d_num_transduced_out_it,
stream);
}

} // namespace cudf::io::fst
Loading

0 comments on commit ebcea0f

Please sign in to comment.