Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.4.1 #492

Merged
merged 25 commits into from
Feb 1, 2025
Merged

v1.4.1 #492

merged 25 commits into from
Feb 1, 2025

Conversation

flarco
Copy link
Collaborator

@flarco flarco commented Jan 23, 2025

Major Changes

  1. Pipeline Implementation

    • Added new Pipeline type for executing sequential steps
    • Introduced pipeline configuration loading from YAML/JSON files
    • Added support for pipeline steps with various types (log, replication, command, etc.)
  2. State Management Refactoring

    • Renamed Hooks to State in runtime state management
    • Created RuntimeState interface
    • Added separate ReplicationState and PipelineState implementations
  3. Chunking Functionality

    • Added support for chunking data processing based on column ranges
    • Implemented ProcessChunks() method for handling data partitioning
    • Added chunk size configuration options
  4. Case Handling Improvements

    • Added support for lower/upper case schema and table names
    • Improved temp table name handling for different database types
    • Added new format variables: stream_schema_lower/upper, stream_table_lower/upper
  5. File System Updates

    • Enhanced file pattern matching across different file system implementations
    • Improved file deletion handling for SFTP
    • Better handling of file paths in copy operations

Minor Changes

  • Added binary data handling for DuckDB
  • Improved UTF-8 validation in string casting
  • Added environment variable controls for DuckDB compute
  • Various bug fixes and code cleanup

flarco added 18 commits January 23, 2025 17:32
- conditionally load extensions based on environment variable DUCKDB_USE_INSTALLED_EXTENSIONS
- improves flexibility and avoids redundant installation when extensions are already present
- added SLING_DUCKDB_COMPUTE environment variable to allow disabling duckdb computation for tasks.  This allows for easier testing and debugging scenarios where duckdb may not be desired or available.
- added pipeline configuration file support
- implemented pipeline execution logic
- added pipeline tests
- updated CLI to support pipeline configuration
- updated documentation to reflect pipeline functionality
- added pipeline state management
- added support for pipeline steps
- changed the top-level key from `pipeline` to `steps` in the YAML configuration for pipeline definitions.
- updated the `LoadPipelineConfig` function to correctly parse the `steps` key instead of `pipeline`.
- this ensures compatibility with the updated YAML structure and prevents errors during pipeline loading.
- Refactor runtime state handling to use a consistent `state` map instead of `hooks` map.
- Update pipeline and replication configurations to utilize the new `state` map.
- Modify `SetStateData` and `SetStateKeyValue` functions in `RuntimeState` interface.
- Adjust test YAML files to reflect the state changes.
- Enhance hook execution to incorporate the new state management.
- Updated the List method across all file system clients (Azure, FTP, Google, Local, S3, SFTP) to support glob pattern matching.
- Fixed a bug in the `CopyRecursive` function where the destination path was incorrectly constructed, leading to potential issues when the `toPath` already ended with a `/`.  The fix uses `strings.TrimSuffix` to remove any trailing `/` from `toPath` before appending the relative path.
- Fixed a bug where relative paths were not correctly calculated during recursive file copy, leading to incorrect destination paths.
- Improved handling of both single files and nested files.  The new logic correctly determines the relative path in all cases.
- Added checks to ensure correct destination path construction when `toPath` is a directory.
- Updated `ExecuteOnDone` in `Hook` and `Step` interfaces to return an `OnFailType` and an error.
- Modified `Hooks.Execute` and `Pipeline.Execute` to handle the returned `OnFailType` and error appropriately.  This improves error handling and allows for more granular control over failure scenarios.
- Added lower and upper case versions of schema and table names to the StreamState struct.
- Updated StateSet function to populate these new fields with the corresponding values from the format map.
- This enhancement provides more flexibility and consistency in handling schema and table names.
- The lower and upper case versions can be used for case-insensitive comparisons or other operations where case sensitivity is not required.
- Added specific handling for casting boolean strings ('true'/'false') to integers (1/0) in SQL Server queries.  This addresses potential data type mismatch issues when selecting boolean columns into integer columns.
- Correctly handle cases where a string column is cast to an integer, including non-boolean values.  The previous implementation only handled 'true' and 'false' strings, causing incorrect results for other string values.  This change adds an `else {col}` clause to handle these cases.
- added `stream_table_lower` and `stream_table_upper` variables to support specifying a range of tables for replication
- added support for chunking large datasets using the `chunk_size` option in the replication configuration
- implemented `ChunkByColumnRange` function to generate chunk ranges based on the specified column and size
- updated replication process to handle chunked streams
- added tests for chunking functionality
- improved error handling and logging
- updated documentation to reflect the new chunking feature
- added new config option `parallel_chunks` to control how many chunks to run in parallel.
- adjusted test data and scripts to accommodate for chunking
- optimized chunking process to reduce memory usage and improve performance
- updated test cases for improved coverage and accuracy
- disable default pool behavior to avoid unnecessary connection buildup
- improve connection management for source/target options
- TODO: refactor metadata passing for better connection management
- Updated the expected output for the `sling run` command in the test suite.
- Fixed a discrepancy in the expected output string for test case 69 and 70.
- Improved test case accuracy and reliability.
- Added a new test for sling pipeline 02, which includes sftp and aws s3 data transfers.
- Added a new test for sling pipeline 01 to improve test coverage.
@flarco flarco merged commit 4914669 into main Feb 1, 2025
6 of 8 checks passed
@flarco flarco deleted the v1.4.1 branch February 1, 2025 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants