v1.4.1 #492

flarco · 2025-01-23T20:38:25Z

Major Changes

Pipeline Implementation
- Added new Pipeline type for executing sequential steps
- Introduced pipeline configuration loading from YAML/JSON files
- Added support for pipeline steps with various types (log, replication, command, etc.)
State Management Refactoring
- Renamed Hooks to State in runtime state management
- Created RuntimeState interface
- Added separate ReplicationState and PipelineState implementations
Chunking Functionality
- Added support for chunking data processing based on column ranges
- Implemented ProcessChunks() method for handling data partitioning
- Added chunk size configuration options
Case Handling Improvements
- Added support for lower/upper case schema and table names
- Improved temp table name handling for different database types
- Added new format variables: stream_schema_lower/upper, stream_table_lower/upper
File System Updates
- Enhanced file pattern matching across different file system implementations
- Improved file deletion handling for SFTP
- Better handling of file paths in copy operations

Minor Changes

Added binary data handling for DuckDB
Improved UTF-8 validation in string casting
Added environment variable controls for DuckDB compute
Various bug fixes and code cleanup

- conditionally load extensions based on environment variable DUCKDB_USE_INSTALLED_EXTENSIONS - improves flexibility and avoids redundant installation when extensions are already present

- added SLING_DUCKDB_COMPUTE environment variable to allow disabling duckdb computation for tasks. This allows for easier testing and debugging scenarios where duckdb may not be desired or available.

- added pipeline configuration file support - implemented pipeline execution logic - added pipeline tests - updated CLI to support pipeline configuration - updated documentation to reflect pipeline functionality - added pipeline state management - added support for pipeline steps

- changed the top-level key from `pipeline` to `steps` in the YAML configuration for pipeline definitions. - updated the `LoadPipelineConfig` function to correctly parse the `steps` key instead of `pipeline`. - this ensures compatibility with the updated YAML structure and prevents errors during pipeline loading.

- Refactor runtime state handling to use a consistent `state` map instead of `hooks` map. - Update pipeline and replication configurations to utilize the new `state` map. - Modify `SetStateData` and `SetStateKeyValue` functions in `RuntimeState` interface. - Adjust test YAML files to reflect the state changes. - Enhance hook execution to incorporate the new state management.

- Updated the List method across all file system clients (Azure, FTP, Google, Local, S3, SFTP) to support glob pattern matching.

- Fixed a bug in the `CopyRecursive` function where the destination path was incorrectly constructed, leading to potential issues when the `toPath` already ended with a `/`. The fix uses `strings.TrimSuffix` to remove any trailing `/` from `toPath` before appending the relative path.

- Fixed a bug where relative paths were not correctly calculated during recursive file copy, leading to incorrect destination paths. - Improved handling of both single files and nested files. The new logic correctly determines the relative path in all cases. - Added checks to ensure correct destination path construction when `toPath` is a directory.

- Updated `ExecuteOnDone` in `Hook` and `Step` interfaces to return an `OnFailType` and an error. - Modified `Hooks.Execute` and `Pipeline.Execute` to handle the returned `OnFailType` and error appropriately. This improves error handling and allows for more granular control over failure scenarios.

- Added lower and upper case versions of schema and table names to the StreamState struct. - Updated StateSet function to populate these new fields with the corresponding values from the format map. - This enhancement provides more flexibility and consistency in handling schema and table names. - The lower and upper case versions can be used for case-insensitive comparisons or other operations where case sensitivity is not required.

- Added specific handling for casting boolean strings ('true'/'false') to integers (1/0) in SQL Server queries. This addresses potential data type mismatch issues when selecting boolean columns into integer columns.

- Correctly handle cases where a string column is cast to an integer, including non-boolean values. The previous implementation only handled 'true' and 'false' strings, causing incorrect results for other string values. This change adds an `else {col}` clause to handle these cases.

core/sling/config.go

- added `stream_table_lower` and `stream_table_upper` variables to support specifying a range of tables for replication

- added support for chunking large datasets using the `chunk_size` option in the replication configuration - implemented `ChunkByColumnRange` function to generate chunk ranges based on the specified column and size - updated replication process to handle chunked streams - added tests for chunking functionality - improved error handling and logging - updated documentation to reflect the new chunking feature - added new config option `parallel_chunks` to control how many chunks to run in parallel. - adjusted test data and scripts to accommodate for chunking - optimized chunking process to reduce memory usage and improve performance - updated test cases for improved coverage and accuracy

- disable default pool behavior to avoid unnecessary connection buildup - improve connection management for source/target options - TODO: refactor metadata passing for better connection management

- Updated the expected output for the `sling run` command in the test suite. - Fixed a discrepancy in the expected output string for test case 69 and 70. - Improved test case accuracy and reliability.

- Added a new test for sling pipeline 02, which includes sftp and aws s3 data transfers. - Added a new test for sling pipeline 01 to improve test coverage.

flarco added 18 commits January 23, 2025 17:32

improve delete function for FTP and SFTP

18d6b1f

handle pre-installed extensions

e414700

- conditionally load extensions based on environment variable DUCKDB_USE_INSTALLED_EXTENSIONS - improves flexibility and avoids redundant installation when extensions are already present

add env var to disable duckdb compute

ccfc811

- added SLING_DUCKDB_COMPUTE environment variable to allow disabling duckdb computation for tasks. This allows for easier testing and debugging scenarios where duckdb may not be desired or available.

remove unnecessary error wrapping in task_run_write.go

2486f5d

add command example to pipeline

4376b33

implement glob pattern matching in List method

2353bbf

- Updated the List method across all file system clients (Azure, FTP, Google, Local, S3, SFTP) to support glob pattern matching.

add timestamp type mapping for Azure DWH, Azure SQL, and SQL Server

ca5acb3

handle invalid UTF-8 characters in CSV output

16895a4

handle boolean to integer casting in sqlserver

abb1169

- Added specific handling for casting boolean strings ('true'/'false') to integers (1/0) in SQL Server queries. This addresses potential data type mismatch issues when selecting boolean columns into integer columns.

handle boolean to integer cast in sqlserver

c69e765

pvanderlinden reviewed Feb 1, 2025

View reviewed changes

core/sling/config.go Show resolved Hide resolved

flarco added 7 commits February 1, 2025 13:28

add stream table range variables

f28095a

- added `stream_table_lower` and `stream_table_upper` variables to support specifying a range of tables for replication

disable default pool behavior

e18771a

- disable default pool behavior to avoid unnecessary connection buildup - improve connection management for source/target options - TODO: refactor metadata passing for better connection management

update sling test suite

91dac05

- Updated the expected output for the `sling run` command in the test suite. - Fixed a discrepancy in the expected output string for test case 69 and 70. - Improved test case accuracy and reliability.

update p.02.yaml

7feef27

add two new sling pipeline tests

e1380ff

- Added a new test for sling pipeline 02, which includes sftp and aws s3 data transfers. - Added a new test for sling pipeline 01 to improve test coverage.

update sling pipeline test descriptions

1543166

flarco merged commit 4914669 into main Feb 1, 2025
6 of 8 checks passed

flarco deleted the v1.4.1 branch February 1, 2025 21:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.4.1 #492

v1.4.1 #492

flarco commented Jan 23, 2025 •

edited

Loading

v1.4.1 #492

v1.4.1 #492

Conversation

flarco commented Jan 23, 2025 • edited Loading

Major Changes

Minor Changes

flarco commented Jan 23, 2025 •

edited

Loading