-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Adding pg_legacy_replication verified source using decoderbufs #589
base: master
Are you sure you want to change the base?
Changes from 78 commits
59e7557
79220b7
9de0835
75a0f7f
73704af
7d1b8e7
ecbf98d
3ed14da
9fe0301
197ba82
c897ee0
d303c04
6566fe4
70d40a0
f703431
f001633
c0df7c9
aa464d5
db09568
c3c0518
a5b1a87
7fad621
fbc65bc
1299b60
f44853b
46200ca
f0f0146
cd8d906
02851f4
beef6ea
77242e8
f9cdf78
327b44c
a9a9bb7
37acc35
a90acee
f927f13
cc7ad61
f9c7694
927ae03
3f31752
637a6e9
8a8134b
ee3cb9c
1727456
81fdce8
8fbfc62
fd4638b
526eff3
2f5ad15
32063e2
28f463d
385e8a6
a291b69
ba23505
5993fb4
6db693a
c53c9f9
ba1c3fc
6b960df
9fa9d98
1947029
7a7ba30
a752581
4184ca9
3c7232f
c8f1ad2
7024ce7
63b1de0
e8b2a0c
4c33129
0b7c151
ec72e36
ecc6089
dd5a63b
d377423
acdf446
a3dc99d
c9c5bcb
d695afb
8f45283
41f8ded
129b18a
9083611
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# Postgres legacy replication | ||
[Postgres](https://www.postgresql.org/) is one of the most popular relational database management systems. This verified source uses Postgres' replication functionality to efficiently process changes | ||
in tables (a process often referred to as _Change Data Capture_ or CDC). It uses [logical decoding](https://www.postgresql.org/docs/current/logicaldecoding.html) and the optional `decoderbufs` | ||
[output plugin](https://github.com/debezium/postgres-decoderbufs), which is a shared library which must be built or enabled. | ||
|
||
| Source | Description | | ||
|---------------------|-------------------------------------------------| | ||
| replication_source | Load published messages from a replication slot | | ||
|
||
## Install decoderbufs | ||
|
||
Instructions can be found [here](https://github.com/debezium/postgres-decoderbufs?tab=readme-ov-file#building) | ||
|
||
Below is an example installation in a docker image: | ||
```Dockerfile | ||
FROM postgres:14 | ||
|
||
# Install dependencies required to build decoderbufs | ||
RUN apt-get update | ||
RUN apt-get install -f -y \ | ||
software-properties-common \ | ||
build-essential \ | ||
pkg-config \ | ||
git | ||
|
||
RUN apt-get install -f -y \ | ||
postgresql-server-dev-14 \ | ||
libprotobuf-c-dev && \ | ||
rm -rf /var/lib/apt/lists/* | ||
|
||
ARG decoderbufs_version=v1.7.0.Final | ||
RUN git clone https://github.com/debezium/postgres-decoderbufs -b $decoderbufs_version --single-branch && \ | ||
cd postgres-decoderbufs && \ | ||
make && make install && \ | ||
cd .. && \ | ||
rm -rf postgres-decoderbufs | ||
``` | ||
|
||
## Initialize the pipeline | ||
|
||
```bash | ||
$ dlt init pg_legacy_replication duckdb | ||
``` | ||
|
||
This uses `duckdb` as destination, but you can choose any of the supported [destinations](https://dlthub.com/docs/dlt-ecosystem/destinations/). | ||
|
||
## Set up user | ||
|
||
The Postgres user needs to have the `LOGIN` and `REPLICATION` attributes assigned: | ||
|
||
```sql | ||
CREATE ROLE replication_user WITH LOGIN REPLICATION; | ||
``` | ||
|
||
It also needs various read only privileges on the database (by first connecting to the database): | ||
|
||
```sql | ||
\connect dlt_data | ||
GRANT USAGE ON SCHEMA schema_name TO replication_user; | ||
GRANT SELECT ON ALL TABLES IN SCHEMA public TO replication_user; | ||
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO replication_user; | ||
``` | ||
|
||
## Add credentials | ||
1. Open `.dlt/secrets.toml`. | ||
2. Enter your Postgres credentials: | ||
|
||
```toml | ||
[sources.pg_legacy_replication] | ||
credentials="postgresql://replication_user:<<password>>@localhost:5432/dlt_data" | ||
``` | ||
3. Enter credentials for your chosen destination as per the [docs](https://dlthub.com/docs/dlt-ecosystem/destinations/). | ||
|
||
## Run the pipeline | ||
|
||
1. Install the necessary dependencies by running the following command: | ||
|
||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
1. Now the pipeline can be run by using the command: | ||
|
||
```bash | ||
python pg_legacy_replication_pipeline.py | ||
``` | ||
|
||
1. To make sure that everything is loaded as expected, use the command: | ||
|
||
```bash | ||
dlt pipeline pg_replication_pipeline show | ||
``` | ||
|
||
# Differences between `pg_legacy_replication` and `pg_replication` | ||
|
||
## Overview | ||
|
||
`pg_legacy_replication` is a fork of the verified `pg_replication` source. The primary goal of this fork is to provide logical replication capabilities for Postgres instances running versions | ||
earlier than 10, when the `pgoutput` plugin was not yet available. This fork draws inspiration from the original `pg_replication` source and the `decoderbufs` library, | ||
which is actively maintained by Debezium. | ||
|
||
## Key Differences from `pg_replication` | ||
|
||
### Replication User Ownership Requirements | ||
One of the limitations of native Postgre replication is that the replication user must **own** the tables in order to add them to a **publication**. | ||
Additionally, once a table is added to a publication, it cannot be removed, requiring the creation of a new replication slot, which results in the loss of any state tracking. | ||
|
||
### Limitations in `pg_replication` | ||
The current pg_replication implementation has several limitations: | ||
- It supports only a single initial snapshot of the data. | ||
- It requires `CREATE` access to the source database in order to perform the initial snapshot. | ||
- **Superuser** access is required to replicate entire Postgres schemas. | ||
While the `pg_legacy_replication` source theoretically reads the entire WAL across all schemas, the current implementation using dlt transformers restricts this functionality. | ||
In practice, this has not been a common use case. | ||
- The implementation is opinionated in its approach to data transfer. Specifically, when updates or deletes are required, it defaults to a `merge` write disposition, | ||
which replicates live data without tracking changes over time. | ||
|
||
### Features of `pg_legacy_replication` | ||
|
||
This fork of `pg_replication` addresses the aforementioned limitations and introduces the following improvements: | ||
- Adheres to the dlt philosophy by treating the WAL as an upstream resources. This replication stream is then transformed into various DLT resources, with customizable options for write disposition, | ||
file formats, type hints, etc., specified at the resource level rather than at the source level. | ||
- Supports an initial snapshot of all tables using the transaction slot isolation level. Additionally, ad-hoc snapshots can be performed using the serializable deferred isolation level, | ||
similar to `pg_dump`. | ||
- Emphasizes the use of `pyarrow` and parquet formats for efficient data storage and transfer. A dedicated backend has been implemented to support these formats. | ||
- Replication messages are decoded using Protocol Buffers (protobufs) in C, rather than relying on native Python byte buffer parsing. This ensures greater efficiency and performance. | ||
|
||
## Next steps | ||
- Add support for the [wal2json](https://github.com/eulerto/wal2json) replication plugin. This is particularly important for environments such as **Amazon RDS**, which supports `wal2json`, | ||
- as opposed to on-premise or Google Cloud SQL instances that support `decoderbufs`. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,140 @@ | ||
"""Replicates postgres tables in batch using logical decoding.""" | ||
|
||
from typing import Any, Callable, Dict, Iterable, Mapping, Optional, Sequence, Union | ||
|
||
import dlt | ||
from dlt.extract import DltResource | ||
from dlt.extract.items import TDataItem | ||
from dlt.sources.credentials import ConnectionStringCredentials | ||
from collections import defaultdict | ||
|
||
from .helpers import ( | ||
BackendHandler, | ||
ItemGenerator, | ||
ReplicationOptions, | ||
advance_slot, | ||
cleanup_snapshot_resources, | ||
get_max_lsn, | ||
init_replication, | ||
) | ||
|
||
|
||
@dlt.source | ||
def replication_source( | ||
slot_name: str, | ||
schema: str, | ||
table_names: Union[str, Sequence[str]], | ||
credentials: ConnectionStringCredentials = dlt.secrets.value, | ||
repl_options: Optional[Mapping[str, ReplicationOptions]] = None, | ||
target_batch_size: int = 1000, | ||
flush_slot: bool = True, | ||
) -> Iterable[DltResource]: | ||
""" | ||
Defines a dlt source for replicating Postgres tables using logical replication. | ||
This source reads from a replication slot and pipes the changes using transformers. | ||
|
||
- Relies on a replication slot that publishes DML operations (i.e. `insert`, `update`, and `delete`). | ||
- Maintains LSN of last consumed message in state to track progress. | ||
- At start of the run, advances the slot upto last consumed message in previous run (for pg>10 only) | ||
- Processes in batches to limit memory usage. | ||
|
||
Args: | ||
slot_name (str): | ||
The name of the logical replication slot used to fetch WAL changes. | ||
schema (str): | ||
Name of the schema to replicate tables from. | ||
table_names (Union[str, Sequence[str]]): | ||
The name(s) of the tables to replicate. Can be a single table name or a list of table names. | ||
credentials (ConnectionStringCredentials): | ||
Database credentials for connecting to the Postgres instance. | ||
repl_options (Optional[Mapping[str, ReplicationOptions]], optional): | ||
A mapping of table names to `ReplicationOptions`, allowing for fine-grained control over | ||
replication behavior for each table. | ||
|
||
Each `ReplicationOptions` dictionary can include the following keys: | ||
- `backend` (Optional[TableBackend]): Specifies the backend to use for table replication. | ||
- `backend_kwargs` (Optional[Mapping[str, Any]]): Additional configuration options for the backend. | ||
- `column_hints` (Optional[TTableSchemaColumns]): A dictionary of hints for column types or properties. | ||
- `include_lsn` (Optional[bool]): Whether to include the LSN (Log Sequence Number) | ||
in the replicated data. Defaults to `True`. | ||
- `include_deleted_ts` (Optional[bool]): Whether to include a timestamp for deleted rows. | ||
Defaults to `True`. | ||
- `include_commit_ts` (Optional[bool]): Whether to include the commit timestamp of each change. | ||
- `include_tx_id` (Optional[bool]): Whether to include the transaction ID of each change. | ||
- `included_columns` (Optional[Set[str]]): A set of specific columns to include in the replication. | ||
If not specified, all columns are included. | ||
target_batch_size (int, optional): | ||
The target size of each batch of replicated data items. Defaults to `1000`. | ||
flush_slot (bool, optional): | ||
If `True`, advances the replication slot to the last processed LSN | ||
to prevent replaying already replicated changes. Defaults to `True`. | ||
|
||
Yields: | ||
Iterable[DltResource]: | ||
A collection of `DltResource` objects, each corresponding to a table being replicated. | ||
|
||
Notes: | ||
- The `repl_options` parameter allows fine-tuning of replication behavior, such as column filtering | ||
or write disposition configuration, per table. | ||
- The replication process is incremental, ensuring only new changes are processed after the last commit LSN. | ||
""" | ||
table_names = [table_names] if isinstance(table_names, str) else table_names or [] | ||
repl_options = defaultdict(lambda: ReplicationOptions(), repl_options or {}) | ||
|
||
@dlt.resource(name=lambda args: args["slot_name"], standalone=True) | ||
def replication_resource(slot_name: str) -> Iterable[TDataItem]: | ||
# start where we left off in previous run | ||
start_lsn = dlt.current.resource_state().get("last_commit_lsn", 0) | ||
if flush_slot: | ||
advance_slot(start_lsn, slot_name, credentials) | ||
|
||
# continue until last message in replication slot | ||
upto_lsn = get_max_lsn(credentials) | ||
if upto_lsn is None: | ||
return | ||
|
||
table_qnames = {f"{schema}.{table_name}" for table_name in table_names} | ||
|
||
# generate items in batches | ||
while True: | ||
gen = ItemGenerator( | ||
credentials=credentials, | ||
slot_name=slot_name, | ||
table_qnames=table_qnames, | ||
upto_lsn=upto_lsn, | ||
start_lsn=start_lsn, | ||
repl_options=repl_options, | ||
target_batch_size=target_batch_size, | ||
) | ||
yield from gen | ||
if gen.generated_all: | ||
dlt.current.resource_state()["last_commit_lsn"] = gen.last_commit_lsn | ||
break | ||
start_lsn = gen.last_commit_lsn | ||
|
||
wal_reader = replication_resource(slot_name) | ||
|
||
for table in table_names: | ||
yield dlt.transformer( | ||
_create_table_dispatch(table, repl_options=repl_options[table]), | ||
data_from=wal_reader, | ||
name=table, | ||
) | ||
|
||
|
||
def _create_table_dispatch( | ||
table: str, repl_options: ReplicationOptions | ||
) -> Callable[[TDataItem], Any]: | ||
"""Creates a dispatch handler that processes data items based on a specified table and optional column hints.""" | ||
handler = BackendHandler(table, repl_options) | ||
# FIXME Uhhh.. why do I have to do this? | ||
handler.__qualname__ = "BackendHandler.__call__" # type: ignore[attr-defined] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah! why? is dlt checking this somewhere? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. I have no idea why this is required here is the output with the line commented ~/repos/github/neuromantik33/verified-sources $ poetry run pytest tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality
============================= test session starts ==============================
platform linux -- Python 3.11.11, pytest-7.4.2, pluggy-1.3.0 -- /home/drnick/repos/github/neuromantik33/verified-sources/.venv/bin/python
cachedir: .pytest_cache
rootdir: /home/drnick/repos/github/neuromantik33/verified-sources
configfile: pytest.ini
plugins: mock-3.12.0, requests-mock-1.11.0, cov-5.0.0, forked-1.6.0, anyio-4.0.0
collecting ... collected 2 items
tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[sqlalchemy-duckdb] FAILED [ 50%]
tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[pyarrow-duckdb] FAILED [100%]
=================================== FAILURES ===================================
__________________ test_core_functionality[sqlalchemy-duckdb] __________________
src_config = (<dlt.pipeline.pipeline.Pipeline object at 0x7f96cf51f3d0>, 'test_slot_a72aac18')
destination_name = 'duckdb', backend = 'sqlalchemy'
@pytest.mark.parametrize("destination_name", ALL_DESTINATIONS)
@pytest.mark.parametrize("backend", ["sqlalchemy", "pyarrow"])
def test_core_functionality(
src_config: Tuple[dlt.Pipeline, str], destination_name: str, backend: TableBackend
) -> None:
@dlt.resource(write_disposition="merge", primary_key="id_x")
def tbl_x(data):
yield data
@dlt.resource(write_disposition="merge", primary_key="id_y")
def tbl_y(data):
yield data
src_pl, slot_name = src_config
src_pl.run(
[
tbl_x({"id_x": 1, "val_x": "foo"}),
tbl_y({"id_y": 1, "val_y": True}),
]
)
add_pk(src_pl.sql_client, "tbl_x", "id_x")
add_pk(src_pl.sql_client, "tbl_y", "id_y")
snapshots = init_replication(
slot_name=slot_name,
schema=src_pl.dataset_name,
table_names=("tbl_x", "tbl_y"),
take_snapshots=True,
table_options={
"tbl_x": {"backend": backend},
"tbl_y": {"backend": backend},
},
)
> changes = replication_source(
slot_name=slot_name,
schema=src_pl.dataset_name,
table_names=("tbl_x", "tbl_y"),
repl_options={
"tbl_x": {"backend": backend},
"tbl_y": {"backend": backend},
},
)
backend = 'sqlalchemy'
destination_name = 'duckdb'
slot_name = 'test_slot_a72aac18'
snapshots = <dlt.extract.source.DltSource object at 0x7f96ceb9fd50>
src_config = (<dlt.pipeline.pipeline.Pipeline object at 0x7f96cf51f3d0>, 'test_slot_a72aac18')
src_pl = <dlt.pipeline.pipeline.Pipeline object at 0x7f96cf51f3d0>
tbl_x = <dlt.extract.resource.DltResource object at 0x7f96d8ba9d50>
tbl_y = <dlt.extract.resource.DltResource object at 0x7f96da95bdd0>
tests/pg_legacy_replication/test_pg_replication.py:65:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:195: in __call__
source = self._deco_f(*args, **kwargs)
args = ()
kwargs = {'repl_options': {'tbl_x': {'backend': 'sqlalchemy'}, 'tbl_y': {'backend': 'sqlalchemy'}}, 'schema': 'src_pl_dataset_202501221235482995', 'slot_name': 'test_slot_a72aac18', 'table_names': ('tbl_x', 'tbl_y')}
self = <dlt.extract.decorators.DltSourceFactoryWrapper object at 0x7f96d7768d90>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:280: in _wrap
return _eval_rv(rv, schema_copy)
_eval_rv = <function DltSourceFactoryWrapper._wrap.<locals>._eval_rv at 0x7f96d773fba0>
_make_schema = <function DltSourceFactoryWrapper._wrap.<locals>._make_schema at 0x7f96cf55c540>
args = ()
conf_f = <function replication_source at 0x7f96cf55c4a0>
kwargs = {'repl_options': {'tbl_x': {'backend': 'sqlalchemy'}, 'tbl_y': {'backend': 'sqlalchemy'}}, 'schema': 'src_pl_dataset_202501221235482995', 'slot_name': 'test_slot_a72aac18', 'table_names': ('tbl_x', 'tbl_y')}
pipeline_name = 'src_pl'
proxy = <dlt.common.pipeline.PipelineContext object at 0x7f96ceb05d10>
rv = <generator object replication_source at 0x7f96cd7182c0>
schema_copy = Schema replication_source at 140285669523216
source_sections = ('sources', 'pg_legacy_replication', 'replication_source')
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:240: in _eval_rv
_rv = list(_rv)
_rv = <generator object replication_source at 0x7f96cd7182c0>
schema_copy = Schema replication_source at 140285669523216
self = <dlt.extract.decorators.DltSourceFactoryWrapper object at 0x7f96d7768d90>
source_section = 'pg_legacy_replication'
sources/pg_legacy_replication/__init__.py:118: in replication_source
yield dlt.transformer(
credentials = <dlt.common.configuration.specs.connection_string_credentials.ConnectionStringCredentials object at 0x7f96cd7ffc50>
flush_slot = True
repl_options = defaultdict(<function replication_source.<locals>.<lambda> at 0x7f96cd740180>, {'tbl_x': {'backend': 'sqlalchemy'}, 'tbl_y': {'backend': 'sqlalchemy'}})
replication_resource = <function replication_source.<locals>.replication_resource at 0x7f96cd7ea700>
schema = 'src_pl_dataset_202501221235482995'
slot_name = 'test_slot_a72aac18'
table = 'tbl_x'
table_names = ('tbl_x', 'tbl_y')
target_batch_size = 1000
wal_reader = <dlt.extract.resource.DltResource object at 0x7f96ceba2050>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:958: in transformer
return resource( # type: ignore
_impl_cls = <class 'dlt.extract.resource.DltResource'>
columns = None
data_from = <dlt.extract.resource.DltResource object at 0x7f96ceba2050>
f = BackendHandler(table='tbl_x', repl_options={'backend': 'sqlalchemy'})
file_format = None
max_table_nesting = None
merge_key = None
name = 'tbl_x'
parallelized = False
primary_key = None
schema_contract = None
selected = True
spec = None
standalone = False
table_format = None
table_name = None
write_disposition = None
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:758: in resource
return decorator(data)
_impl_cls = <class 'dlt.extract.resource.DltResource'>
columns = None
data = BackendHandler(table='tbl_x', repl_options={'backend': 'sqlalchemy'})
data_from = <dlt.extract.resource.DltResource object at 0x7f96ceba2050>
decorator = <function resource.<locals>.decorator at 0x7f96cd7ea980>
file_format = None
make_resource = <function resource.<locals>.make_resource at 0x7f96cd7ea840>
max_table_nesting = None
merge_key = None
name = 'tbl_x'
parallelized = False
primary_key = None
references = None
schema_contract = None
selected = True
spec = None
standalone = False
table_format = None
table_name = None
wrap_standalone = <function resource.<locals>.wrap_standalone at 0x7f96cd7ea8e0>
write_disposition = None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
f = BackendHandler(table='tbl_x', repl_options={'backend': 'sqlalchemy'})
def decorator(
f: Callable[TResourceFunParams, Any]
) -> Callable[TResourceFunParams, TDltResourceImpl]:
if not callable(f):
if data_from:
# raise more descriptive exception if we construct transformer
raise InvalidTransformerDataTypeGeneratorFunctionRequired(
name or "<no name>", f, type(f)
)
raise ResourceFunctionExpected(name or "<no name>", f, type(f))
if not standalone and callable(name):
raise DynamicNameNotStandaloneResource(get_callable_name(f))
resource_name = name if name and not callable(name) else get_callable_name(f)
func_module = inspect.getmodule(f)
source_section = _get_source_section_name(func_module)
is_inner_resource = is_inner_callable(f)
if spec is None:
# autodetect spec
SPEC, resolvable_fields = spec_from_signature(
f, inspect.signature(f), include_defaults=standalone
)
if is_inner_resource and not standalone:
if len(resolvable_fields) > 0:
# prevent required arguments to inner functions that are not standalone
raise ResourceInnerCallableConfigWrapDisallowed(resource_name, source_section)
else:
# empty spec for inner functions - they should not be injected
SPEC = BaseConfiguration
else:
SPEC = spec
# assign spec to "f"
set_fun_spec(f, SPEC)
# register non inner resources as source with single resource in it
if not is_inner_resource:
# a source function for the source wrapper, args that go to source are forwarded
# to a single resource within
def _source(
name_ovr: str, section_ovr: str, args: Tuple[Any, ...], kwargs: Dict[str, Any]
) -> TDltResourceImpl:
return wrap_standalone(name_ovr or resource_name, section_ovr or source_section, f)(
*args, **kwargs
)
# make the source module same as original resource
> _source.__qualname__ = f.__qualname__
E AttributeError: 'BackendHandler' object has no attribute '__qualname__'
SPEC = <class 'sources.pg_legacy_replication.helpers.BackendHandlerConfiguration'>
_source = <function resource.<locals>.decorator.<locals>._source at 0x7f96cd7eaa20>
data_from = <dlt.extract.resource.DltResource object at 0x7f96ceba2050>
f = BackendHandler(table='tbl_x', repl_options={'backend': 'sqlalchemy'})
func_module = <module 'sources.pg_legacy_replication.helpers' from '/home/drnick/repos/github/neuromantik33/verified-sources/sources/pg_legacy_replication/helpers.py'>
is_inner_resource = False
name = 'tbl_x'
resolvable_fields = {}
resource_name = 'tbl_x'
source_section = 'helpers'
spec = None
standalone = False
wrap_standalone = <function resource.<locals>.wrap_standalone at 0x7f96cd7ea8e0>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:731: AttributeError
--------------------------- Captured stdout teardown ---------------------------
schema "src_pl_dataset_202501221235482995" does not exist
schema "src_pl_dataset_202501221235482995_staging" does not exist
___________________ test_core_functionality[pyarrow-duckdb] ____________________
src_config = (<dlt.pipeline.pipeline.Pipeline object at 0x7f96ceb20b10>, 'test_slot_ae9b491f')
destination_name = 'duckdb', backend = 'pyarrow'
@pytest.mark.parametrize("destination_name", ALL_DESTINATIONS)
@pytest.mark.parametrize("backend", ["sqlalchemy", "pyarrow"])
def test_core_functionality(
src_config: Tuple[dlt.Pipeline, str], destination_name: str, backend: TableBackend
) -> None:
@dlt.resource(write_disposition="merge", primary_key="id_x")
def tbl_x(data):
yield data
@dlt.resource(write_disposition="merge", primary_key="id_y")
def tbl_y(data):
yield data
src_pl, slot_name = src_config
src_pl.run(
[
tbl_x({"id_x": 1, "val_x": "foo"}),
tbl_y({"id_y": 1, "val_y": True}),
]
)
add_pk(src_pl.sql_client, "tbl_x", "id_x")
add_pk(src_pl.sql_client, "tbl_y", "id_y")
snapshots = init_replication(
slot_name=slot_name,
schema=src_pl.dataset_name,
table_names=("tbl_x", "tbl_y"),
take_snapshots=True,
table_options={
"tbl_x": {"backend": backend},
"tbl_y": {"backend": backend},
},
)
> changes = replication_source(
slot_name=slot_name,
schema=src_pl.dataset_name,
table_names=("tbl_x", "tbl_y"),
repl_options={
"tbl_x": {"backend": backend},
"tbl_y": {"backend": backend},
},
)
backend = 'pyarrow'
destination_name = 'duckdb'
slot_name = 'test_slot_ae9b491f'
snapshots = <dlt.extract.source.DltSource object at 0x7f96cb9de390>
src_config = (<dlt.pipeline.pipeline.Pipeline object at 0x7f96ceb20b10>, 'test_slot_ae9b491f')
src_pl = <dlt.pipeline.pipeline.Pipeline object at 0x7f96ceb20b10>
tbl_x = <dlt.extract.resource.DltResource object at 0x7f96ccd0c610>
tbl_y = <dlt.extract.resource.DltResource object at 0x7f96ccd0e0d0>
tests/pg_legacy_replication/test_pg_replication.py:65:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:195: in __call__
source = self._deco_f(*args, **kwargs)
args = ()
kwargs = {'repl_options': {'tbl_x': {'backend': 'pyarrow'}, 'tbl_y': {'backend': 'pyarrow'}}, 'schema': 'src_pl_dataset_202501221235515171', 'slot_name': 'test_slot_ae9b491f', 'table_names': ('tbl_x', 'tbl_y')}
self = <dlt.extract.decorators.DltSourceFactoryWrapper object at 0x7f96d7768d90>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:280: in _wrap
return _eval_rv(rv, schema_copy)
_eval_rv = <function DltSourceFactoryWrapper._wrap.<locals>._eval_rv at 0x7f96d773fba0>
_make_schema = <function DltSourceFactoryWrapper._wrap.<locals>._make_schema at 0x7f96cf55c540>
args = ()
conf_f = <function replication_source at 0x7f96cf55c4a0>
kwargs = {'repl_options': {'tbl_x': {'backend': 'pyarrow'}, 'tbl_y': {'backend': 'pyarrow'}}, 'schema': 'src_pl_dataset_202501221235515171', 'slot_name': 'test_slot_ae9b491f', 'table_names': ('tbl_x', 'tbl_y')}
pipeline_name = 'src_pl'
proxy = <dlt.common.pipeline.PipelineContext object at 0x7f96ceb05d10>
rv = <generator object replication_source at 0x7f971204dd00>
schema_copy = Schema replication_source at 140285658134864
source_sections = ('sources', 'pg_legacy_replication', 'replication_source')
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:240: in _eval_rv
_rv = list(_rv)
_rv = <generator object replication_source at 0x7f971204dd00>
schema_copy = Schema replication_source at 140285658134864
self = <dlt.extract.decorators.DltSourceFactoryWrapper object at 0x7f96d7768d90>
source_section = 'pg_legacy_replication'
sources/pg_legacy_replication/__init__.py:118: in replication_source
yield dlt.transformer(
credentials = <dlt.common.configuration.specs.connection_string_credentials.ConnectionStringCredentials object at 0x7f96cb96a0d0>
flush_slot = True
repl_options = defaultdict(<function replication_source.<locals>.<lambda> at 0x7f96cc3dbba0>, {'tbl_x': {'backend': 'pyarrow'}, 'tbl_y': {'backend': 'pyarrow'}})
replication_resource = <function replication_source.<locals>.replication_resource at 0x7f96cc3d85e0>
schema = 'src_pl_dataset_202501221235515171'
slot_name = 'test_slot_ae9b491f'
table = 'tbl_x'
table_names = ('tbl_x', 'tbl_y')
target_batch_size = 1000
wal_reader = <dlt.extract.resource.DltResource object at 0x7f96caff30d0>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:958: in transformer
return resource( # type: ignore
_impl_cls = <class 'dlt.extract.resource.DltResource'>
columns = None
data_from = <dlt.extract.resource.DltResource object at 0x7f96caff30d0>
f = BackendHandler(table='tbl_x', repl_options={'backend': 'pyarrow'})
file_format = None
max_table_nesting = None
merge_key = None
name = 'tbl_x'
parallelized = False
primary_key = None
schema_contract = None
selected = True
spec = None
standalone = False
table_format = None
table_name = None
write_disposition = None
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:758: in resource
return decorator(data)
_impl_cls = <class 'dlt.extract.resource.DltResource'>
columns = None
data = BackendHandler(table='tbl_x', repl_options={'backend': 'pyarrow'})
data_from = <dlt.extract.resource.DltResource object at 0x7f96caff30d0>
decorator = <function resource.<locals>.decorator at 0x7f96cafa7740>
file_format = None
make_resource = <function resource.<locals>.make_resource at 0x7f96cc3d9d00>
max_table_nesting = None
merge_key = None
name = 'tbl_x'
parallelized = False
primary_key = None
references = None
schema_contract = None
selected = True
spec = None
standalone = False
table_format = None
table_name = None
wrap_standalone = <function resource.<locals>.wrap_standalone at 0x7f96cafa6840>
write_disposition = None
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
f = BackendHandler(table='tbl_x', repl_options={'backend': 'pyarrow'})
def decorator(
f: Callable[TResourceFunParams, Any]
) -> Callable[TResourceFunParams, TDltResourceImpl]:
if not callable(f):
if data_from:
# raise more descriptive exception if we construct transformer
raise InvalidTransformerDataTypeGeneratorFunctionRequired(
name or "<no name>", f, type(f)
)
raise ResourceFunctionExpected(name or "<no name>", f, type(f))
if not standalone and callable(name):
raise DynamicNameNotStandaloneResource(get_callable_name(f))
resource_name = name if name and not callable(name) else get_callable_name(f)
func_module = inspect.getmodule(f)
source_section = _get_source_section_name(func_module)
is_inner_resource = is_inner_callable(f)
if spec is None:
# autodetect spec
SPEC, resolvable_fields = spec_from_signature(
f, inspect.signature(f), include_defaults=standalone
)
if is_inner_resource and not standalone:
if len(resolvable_fields) > 0:
# prevent required arguments to inner functions that are not standalone
raise ResourceInnerCallableConfigWrapDisallowed(resource_name, source_section)
else:
# empty spec for inner functions - they should not be injected
SPEC = BaseConfiguration
else:
SPEC = spec
# assign spec to "f"
set_fun_spec(f, SPEC)
# register non inner resources as source with single resource in it
if not is_inner_resource:
# a source function for the source wrapper, args that go to source are forwarded
# to a single resource within
def _source(
name_ovr: str, section_ovr: str, args: Tuple[Any, ...], kwargs: Dict[str, Any]
) -> TDltResourceImpl:
return wrap_standalone(name_ovr or resource_name, section_ovr or source_section, f)(
*args, **kwargs
)
# make the source module same as original resource
> _source.__qualname__ = f.__qualname__
E AttributeError: 'BackendHandler' object has no attribute '__qualname__'
SPEC = <class 'sources.pg_legacy_replication.helpers.BackendHandlerConfiguration'>
_source = <function resource.<locals>.decorator.<locals>._source at 0x7f96cafa76a0>
data_from = <dlt.extract.resource.DltResource object at 0x7f96caff30d0>
f = BackendHandler(table='tbl_x', repl_options={'backend': 'pyarrow'})
func_module = <module 'sources.pg_legacy_replication.helpers' from '/home/drnick/repos/github/neuromantik33/verified-sources/sources/pg_legacy_replication/helpers.py'>
is_inner_resource = False
name = 'tbl_x'
resolvable_fields = {}
resource_name = 'tbl_x'
source_section = 'helpers'
spec = None
standalone = False
wrap_standalone = <function resource.<locals>.wrap_standalone at 0x7f96cafa6840>
.venv/lib/python3.11/site-packages/dlt/extract/decorators.py:731: AttributeError
----------------------------- Captured stderr call -----------------------------
2025-01-22 13:35:54,498|[WARNING]|38699|140286848428928|dlt|source.py|register:572|A source with ref dlt.helpers.tbl_x is already registered and will be overwritten
2025-01-22 13:35:54,513|[WARNING]|38699|140286848428928|dlt|source.py|register:572|A source with ref dlt.helpers.tbl_y is already registered and will be overwritten
--------------------------- Captured stdout teardown ---------------------------
schema "src_pl_dataset_202501221235515171" does not exist
schema "src_pl_dataset_202501221235515171_staging" does not exist
=============================== warnings summary ===============================
.venv/lib/python3.11/site-packages/dlt/helpers/dbt/__init__.py:3
/home/drnick/repos/github/neuromantik33/verified-sources/.venv/lib/python3.11/site-packages/dlt/helpers/dbt/__init__.py:3: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
import pkg_resources
.venv/lib/python3.11/site-packages/dlt/common/configuration/container.py:95: 19 warnings
tests/pg_legacy_replication/test_pg_replication.py: 1958 warnings
/home/drnick/repos/github/neuromantik33/verified-sources/.venv/lib/python3.11/site-packages/dlt/common/configuration/container.py:95: DeprecationWarning: currentThread() is deprecated, use current_thread() instead
if m := re.match(r"dlt-pool-(\d+)-", threading.currentThread().getName()):
.venv/lib/python3.11/site-packages/dlt/common/configuration/container.py:95: 19 warnings
tests/pg_legacy_replication/test_pg_replication.py: 1958 warnings
/home/drnick/repos/github/neuromantik33/verified-sources/.venv/lib/python3.11/site-packages/dlt/common/configuration/container.py:95: DeprecationWarning: getName() is deprecated, get the name attribute instead
if m := re.match(r"dlt-pool-(\d+)-", threading.currentThread().getName()):
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================= slowest 10 durations =============================
2.99s call tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[pyarrow-duckdb]
2.89s call tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[sqlalchemy-duckdb]
0.17s teardown tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[pyarrow-duckdb]
0.17s teardown tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[sqlalchemy-duckdb]
0.07s setup tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[sqlalchemy-duckdb]
0.04s setup tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[pyarrow-duckdb]
=========================== short test summary info ============================
FAILED tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[sqlalchemy-duckdb]
FAILED tests/pg_legacy_replication/test_pg_replication.py::test_core_functionality[pyarrow-duckdb]
======================= 2 failed, 3955 warnings in 6.67s ======================= |
||
return handler | ||
|
||
|
||
__all__ = [ | ||
"ReplicationOptions", | ||
"cleanup_snapshot_resources", | ||
"init_replication", | ||
"replication_source", | ||
] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# class SqlDatabaseSourceImportError(Exception): | ||
# def __init__(self) -> None: | ||
# super().__init__( | ||
# "Could not import `sql_database` source. Run `dlt init sql_database <dest>`" | ||
# " to download the source code." | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does it mean that
decoderbufs
is optional? if not present, we decode on the client?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The heavy lifting is done by the decoderbufs extension, which must be added if using a managed postgres like Cloud SQL, or compiled and installed on a self hosted postgres installation.
Detailed instructions can be found here : https://debezium.io/documentation/reference/stable/postgres-plugins.html#logical-decoding-output-plugin-installation
FYI decoderbufs is the default logical replication plugin used by Debezium.