-
Notifications
You must be signed in to change notification settings - Fork 55
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
sql_database: uses full reflection by default (#525)
* skips NULL columns in arrow tables inferred from the data * uses full reflection level by default
- Loading branch information
Showing
7 changed files
with
43 additions
and
93 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -203,75 +203,3 @@ No issues found. Postgres is the only backend where we observed 2x speedup with | |
## Learn more | ||
💡 To explore additional customizations for this pipeline, we recommend referring to the official DLT SQL Database verified documentation. It provides comprehensive information and guidance on how to further customize and tailor the pipeline to suit your specific needs. You can find the DLT SQL Database documentation in [Setup Guide: SQL Database.](https://dlthub.com/docs/dlt-ecosystem/verified-sources/sql_database) | ||
## Extended configuration | ||
You are able to configure most of the arguments to `sql_database` and `sql_table` via toml files and environment variables. This is particularly useful with `sql_table` | ||
because you can maintain a separate configuration for each table (below we show **secrets.toml** and **config.toml**, you are free to combine them into one.): | ||
```toml | ||
[sources.sql_database] | ||
credentials="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server" | ||
``` | ||
```toml | ||
[sources.sql_database.chat_message] | ||
backend="pandas" | ||
chunk_size=1000 | ||
[sources.sql_database.chat_message.incremental] | ||
cursor_path="updated_at" | ||
``` | ||
Example above will setup **backend** and **chunk_size** for a table with name **chat_message**. It will also enable incremental loading on a column named **updated_at**. | ||
Table resource is instantiated as follows: | ||
```python | ||
table = sql_table(table="chat_message", schema="data") | ||
``` | ||
Similarly, you can configure `sql_database` source. | ||
```toml | ||
[sources.sql_database] | ||
credentials="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server" | ||
[sources.sql_database] | ||
schema="data" | ||
backend="pandas" | ||
chunk_size=1000 | ||
[sources.sql_database.chat_message.incremental] | ||
cursor_path="updated_at" | ||
``` | ||
Note that we are able to configure incremental loading per table, even if it is a part of a dlt source. Source below will extract data using **pandas** backend | ||
with **chunk_size** 1000. **chat_message** table will load data incrementally using **updated_at** column. All other tables will load fully. | ||
```python | ||
database = sql_database() | ||
``` | ||
You can configure all the arguments this way (except adapter callback function). [Standard dlt rules apply](https://dlthub.com/docs/general-usage/credentials/configuration#configure-dlt-sources-and-resources). You can use environment variables [by translating the names properly](https://dlthub.com/docs/general-usage/credentials/config_providers#toml-vs-environment-variables) ie. | ||
```sh | ||
SOURCES__SQL_DATABASE__CREDENTIALS="mssql+pyodbc://loader.database.windows.net/dlt_data?trusted_connection=yes&driver=ODBC+Driver+17+for+SQL+Server" | ||
SOURCES__SQL_DATABASE__BACKEND=pandas | ||
SOURCES__SQL_DATABASE__CHUNK_SIZE=1000 | ||
SOURCES__SQL_DATABASE__CHAT_MESSAGE__INCREMENTAL__CURSOR_PATH=updated_at | ||
``` | ||
### Configuring incremental loading | ||
`dlt.sources.incremental` class is a [config spec](https://dlthub.com/docs/general-usage/credentials/config_specs) and can be configured like any other spec, here's an example that sets all possible options: | ||
```toml | ||
[sources.sql_database.chat_message.incremental] | ||
cursor_path="updated_at" | ||
initial_value=2024-05-27T07:32:00Z | ||
end_value=2024-05-28T07:32:00Z | ||
row_order="asc" | ||
allow_external_schedulers=false | ||
``` | ||
Please note that we specify date times in **toml** as initial and end value. For env variables only strings are currently supported. | ||
### Use SqlAlchemy Engine as credentials | ||
You are able to pass an instance of **SqlAlchemy** `Engine` instance instead of credentials: | ||
```python | ||
from sqlalchemy import create_engine | ||
engine = create_engine("mysql+pymysql://[email protected]:4497/Rfam") | ||
table = sql_table(engine, table="chat_message", schema="data") | ||
``` | ||
Engine is used by `dlt` to open database connections and can work across multiple threads so is compatible with `parallelize` setting of dlt sources and resources. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -171,6 +171,7 @@ def test_load_sql_schema_loads_all_tables( | |
credentials=sql_source_db.credentials, | ||
schema=sql_source_db.schema, | ||
backend=backend, | ||
reflection_level="minimal", | ||
type_adapter_callback=default_test_callback(destination_name, backend), | ||
) | ||
|
||
|
@@ -204,6 +205,7 @@ def test_load_sql_schema_loads_all_tables_parallel( | |
credentials=sql_source_db.credentials, | ||
schema=sql_source_db.schema, | ||
backend=backend, | ||
reflection_level="minimal", | ||
type_adapter_callback=default_test_callback(destination_name, backend), | ||
).parallelize() | ||
|
||
|
@@ -235,6 +237,7 @@ def test_load_sql_table_names( | |
credentials=sql_source_db.credentials, | ||
schema=sql_source_db.schema, | ||
table_names=tables, | ||
reflection_level="minimal", | ||
backend=backend, | ||
) | ||
) | ||
|
@@ -263,6 +266,7 @@ def make_source(): | |
credentials=sql_source_db.credentials, | ||
schema=sql_source_db.schema, | ||
table_names=tables, | ||
reflection_level="minimal", | ||
backend=backend, | ||
) | ||
|
||
|
@@ -304,6 +308,7 @@ def test_load_mysql_data_load(destination_name: str, backend: TableBackend) -> N | |
credentials="mysql+pymysql://[email protected]:4497/Rfam", | ||
table="family", | ||
backend=backend, | ||
reflection_level="minimal", | ||
backend_kwargs=backend_kwargs, | ||
# table_adapter_callback=_double_as_decimal_adapter, | ||
) | ||
|
@@ -318,6 +323,7 @@ def test_load_mysql_data_load(destination_name: str, backend: TableBackend) -> N | |
credentials="mysql+pymysql://[email protected]:4497/Rfam", | ||
table="family", | ||
backend=backend, | ||
reflection_level="minimal", | ||
# we also try to remove dialect automatically | ||
backend_kwargs={}, | ||
# table_adapter_callback=_double_as_decimal_adapter, | ||
|
@@ -341,6 +347,7 @@ def sql_table_source() -> List[DltResource]: | |
credentials=sql_source_db.credentials, | ||
schema=sql_source_db.schema, | ||
table="chat_message", | ||
reflection_level="minimal", | ||
backend=backend, | ||
) | ||
] | ||
|
@@ -365,6 +372,7 @@ def sql_table_source() -> List[DltResource]: | |
schema=sql_source_db.schema, | ||
table="chat_message", | ||
incremental=dlt.sources.incremental("updated_at"), | ||
reflection_level="minimal", | ||
backend=backend, | ||
) | ||
] | ||
|
@@ -395,6 +403,7 @@ def sql_table_source() -> List[DltResource]: | |
"updated_at", | ||
sql_source_db.table_infos["chat_message"]["created_at"].start_value, | ||
), | ||
reflection_level="minimal", | ||
backend=backend, | ||
) | ||
] | ||
|
@@ -941,17 +950,24 @@ def test_sql_table_from_view( | |
credentials=sql_source_db.credentials, | ||
table="chat_message_view", | ||
schema=sql_source_db.schema, | ||
backend=backend, | ||
# use minimal level so we infer types from DATA | ||
reflection_level="minimal", | ||
incremental=dlt.sources.incremental("_created_at") | ||
) | ||
|
||
pipeline = make_pipeline("duckdb") | ||
pipeline.run(table) | ||
info = pipeline.run(table) | ||
assert_load_info(info) | ||
|
||
assert_row_counts(pipeline, sql_source_db, ["chat_message_view"]) | ||
assert "content" in pipeline.default_schema.tables["chat_message_view"]["columns"] | ||
assert ( | ||
"content" | ||
in load_tables_to_dicts(pipeline, "chat_message_view")["chat_message_view"][0] | ||
) | ||
assert "_created_at" in pipeline.default_schema.tables["chat_message_view"]["columns"] | ||
db_data = load_tables_to_dicts(pipeline, "chat_message_view")["chat_message_view"] | ||
assert "content" in db_data[0] | ||
assert "_created_at" in db_data[0] | ||
# make sure that all NULLs is not present | ||
assert "_null_ts" in pipeline.default_schema.tables["chat_message_view"]["columns"] | ||
|
||
|
||
@pytest.mark.parametrize("backend", ["sqlalchemy", "pyarrow", "pandas", "connectorx"]) | ||
|