From d230aff6b4bb9a18a16dd1f12efb5c56cc0740f0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philippe=20No=C3=ABl?= Date: Mon, 28 Oct 2024 23:46:28 +0000 Subject: [PATCH 1/4] Cleanup --- .github/CODEOWNERS | 2 +- .github/workflows/test-pg_analytics.yml | 3 - CONTRIBUTING.md | 54 ++---- Cargo.lock | 4 +- Cargo.toml | 4 +- README.md | 113 +++++------- sql/pg_analytics--0.2.1--0.2.2.sql | 1 + src/api/mod.rs | 1 - src/api/time_bucket.rs | 58 ------ src/lib.rs | 8 - tests/Cargo.toml | 6 +- tests/README.md | 4 +- tests/tests/datetime.rs | 230 ------------------------ tests/tests/scan.rs | 1 - 14 files changed, 64 insertions(+), 425 deletions(-) create mode 100644 sql/pg_analytics--0.2.1--0.2.2.sql delete mode 100644 src/api/time_bucket.rs diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS index 08c15758..7c50a3f6 100644 --- a/.github/CODEOWNERS +++ b/.github/CODEOWNERS @@ -5,4 +5,4 @@ /assets/ @philippemnoel @rebasedming /sql/ @philippemnoel @rebasedming /src/ @rebasedming @philippemnoel -/tests/ @rebasedming @neilyio +/tests/ @rebasedming @philippemnoel diff --git a/.github/workflows/test-pg_analytics.yml b/.github/workflows/test-pg_analytics.yml index 4e63db3b..d9f6285e 100644 --- a/.github/workflows/test-pg_analytics.yml +++ b/.github/workflows/test-pg_analytics.yml @@ -144,9 +144,6 @@ jobs: LLVM_PROFILE_FILE: target/coverage/pg_analytics-%p-%m.profraw RUST_BACKTRACE: full run: | - # Variables (we disable telemetry to avoid skewing the user metrics with CI runs) - PARADEDB_TELEMETRY=false - echo "" echo "Enabling code coverage..." echo -e "\n# Enable code coverage on Linux only, for CI builds\n[target.'cfg(target_os=\"linux\")']\nrustflags = [\"-Cinstrument-coverage\"]" >> .cargo/config.toml diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index e516cc60..b8ba4336 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,13 +1,10 @@ -# **Contributing to ParadeDB** +# **Contributing to pg_analytics** -Welcome! We're excited that you're interested in contributing to ParadeDB and want to make the process as smooth as possible. +Welcome! We're excited that you're interested in contributing to `pg_analytics` and want to make the process as smooth as possible. ## Technical Info -Before submitting a pull request, please review this document, which outlines what -conventions to follow when submitting changes. If you have any questions not covered -in this document, please reach out to us in the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-217mordsh-ielS6BiZf7VW3rqKBFgAlQ) -or via [email](support@paradedb.com). +Before submitting a pull request, please review this document, which outlines what conventions to follow when submitting changes. If you have any questions not covered in this document, please reach out to us in the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ) or via [email](mailto:support@paradedb.com). ### Claiming GitHub Issues @@ -26,52 +23,27 @@ work on the issue(s) you self-assigned, please use the `unassign me` link at the ### Development Workflow -ParadeDB is structured as a monorepo containing all the projects, PostgreSQL extension(s), and other -tools which together make ParadeDB. For development instructions regarding a specific project or Postgres extension, -please refer to the README in the project's subfolder. For developing ParadeDB itself as the combination -of all its subprojects, please see below. - -All development of ParadeDB is done via Docker and Compose. Our Docker setup is split into three: - -- The `docker-compose.dev.yml` file builds our `Dockerfile`, the ParadeDB production image with all its features and extensions enabled. It is used to develop and test ParadeDB Postgres extensions and features as part of the full ParadeDB image. It is also used to develop and test new features and extensions outside of those actively developed by ParadeDB (for instance, installing a new third-party open-source PostgreSQL extension). We recommend using it when developing new features beyond the ParadeDB extensions and subprojects. - -- The `docker-compose.yml` file pulls the latest published ParadeDB image from DockerHub. It is used for hobby production deployments. We recommend using it to deploy ParadeDB in your own infrastructure. +The development of the `pg_analytics` Postgres extension is done via `pgrx`. For detailed development instructions, please refer to the Development section of the README in the extension's subfolder. ### Pull Request Workflow -All changes to ParadeDB happen through Github Pull Requests. Here is the recommended +All changes to `pg_analytics` happen through GitHub Pull Requests. Here is the recommended flow for making a change: -1. Before working on a change, please check to see if there is already a GitHub - issue open for that change. -2. If there is not, please open an issue first. This gives the community visibility - into what you're working on and allows others to make suggestions and leave comments. -3. Fork the ParadeDB repo and branch out from the `dev` branch. -4. Install pre-commit hooks within your fork with `pre-commit install`, to ensure code quality and consistency with upstream. -5. Make your changes. If you've added new functionality, please add tests. -6. Open a pull request towards the `dev` branch. Ensure that all tests and checks - pass. Note that the ParadeDB repository has pull request title linting in place - and follows the [Conventional Commits spec](https://github.com/amannn/action-semantic-pull-request). +1. Before working on a change, please check to see if there is already a GitHub issue open for that change. +2. If there is not, please open an issue first. This gives the community visibility into what you're working on and allows others to make suggestions and leave comments. +3. Fork the `pg_analytics` repo and branch out from the `dev` branch. +4. Install [pre-commit](https://pre-commit.com/) hooks within your fork with `pre-commit install` to ensure code quality and consistency with upstream. +5. Make your changes. If you've added new functionality, please add tests. We will not merge a feature without appropriate tests. +6. Open a pull request towards the `dev` branch. Ensure that all tests and checks pass. Note that the `pg_analytics` repository has pull request title linting in place and follows the [Conventional Commits spec](https://github.com/amannn/action-semantic-pull-request). 7. Congratulations! Our team will review your pull request. ### Documentation -ParadeDB's public-facing documentation is stored in the `docs` folder. If you are -adding a new feature that requires new documentation, please open a separate pull -request containing changes to the documentation only. Once your main pull request -is merged, the ParadeDB team will review and eventually merge your documentation -changes as well. +The public-facing documentation for `pg_analytics` is written directly in the README. If you are adding a new feature that requires new documentation, please add the documentation as part of your pull request. We will not merge a feature without appropriate documentation. ## Legal Info -### Contributor License Agreement - -In order for us, Retake, Inc. (dba ParadeDB) to accept patches and other contributions from you, you need to adopt our ParadeDB Contributor License Agreement (the "**CLA**"). The current version of the CLA can be found [here](https://cla-assistant.io/paradedb/paradedb). - -ParadeDB uses a tool called CLA Assistant to help us keep track of the CLA status of contributors. CLA Assistant will post a comment to your pull request, indicating whether you have signed the CLA or not. If you have not signed the CLA, you will need to do so before we can accept your contribution. Signing the CLA is a one-time process, is valid for all future contributions to ParadeDB, and can be done in under a minute by signing in with your GitHub account. - -If you have any questions about the CLA, please reach out to us in the [ParadeDB Community Slack](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-217mordsh-ielS6BiZf7VW3rqKBFgAlQ) or via email at [legal@paradedb.com](mailto:legal@paradedb.com). - ### License -By contributing to ParadeDB, you agree that your contributions will be licensed under the [GNU Affero General Public License v3.0](LICENSE). +By contributing to `pg_analytics`, you agree that your contributions will be licensed under the [PostgreSQL License](LICENSE). diff --git a/Cargo.lock b/Cargo.lock index c9381f55..74a0fa3e 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -3898,7 +3898,7 @@ dependencies = [ [[package]] name = "pg_analytics" -version = "0.2.1" +version = "0.2.2" dependencies = [ "anyhow", "async-std", @@ -5581,7 +5581,7 @@ dependencies = [ [[package]] name = "tests" -version = "0.2.1" +version = "0.2.2" dependencies = [ "anyhow", "async-std", diff --git a/Cargo.toml b/Cargo.toml index 2812ef34..85222585 100644 --- a/Cargo.toml +++ b/Cargo.toml @@ -1,9 +1,9 @@ [package] name = "pg_analytics" description = "Postgres for analytics, powered by DuckDB" -version = "0.2.1" +version = "0.2.2" edition = "2021" -license = "AGPL-3.0" +license = "PostgreSQL" [lib] crate-type = ["cdylib", "rlib"] diff --git a/README.md b/README.md index 2f53be11..d8bf5540 100644 --- a/README.md +++ b/README.md @@ -3,17 +3,22 @@
-[![Test pg_analytics](https://github.com/paradedb/pg_analytics/actions/workflows/test-pg_analytics.yml/badge.svg)](https://github.com/paradedb/pg_analytics/actions/workflows/test-pg_analytics.yml) +[![Publish pg_analytics](https://github.com/paradedb/pg_analytics/actions/workflows/publish-pg_analytics.yml/badge.svg)](https://github.com/paradedb/pg_analytics/actions/workflows/publish-pg_analytics.yml) +[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/paradedb)](https://artifacthub.io/packages/search?repo=paradedb) +[![Docker Pulls](https://img.shields.io/docker/pulls/paradedb/paradedb)](https://hub.docker.com/r/paradedb/paradedb) +[![License](https://img.shields.io/github/license/paradedb/paradedb?color=blue)](https://github.com/paradedb/pg_analytics?tab=PostgreSQL-1-ov-file#readme) +[![Slack URL](https://img.shields.io/badge/Join%20Slack-purple?logo=slack&link=https%3A%2F%2Fjoin.slack.com%2Ft%2Fparadedbcommunity%2Fshared_invite%2Fzt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ)](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ) +[![X URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb)](https://x.com/paradedb) ## Overview -`pg_analytics` (formerly named `pg_lakehouse`) puts DuckDB inside Postgres. +`pg_analytics` (formerly named `pg_lakehouse`) puts DuckDB inside Postgres. With `pg_analytics` installed, Postgres can query foreign object stores like AWS S3 and table formats like Iceberg or Delta Lake. Queries are pushed down to DuckDB, a high performance analytical query engine. -With `pg_analytics` installed, Postgres can query foreign object stores like S3 and table formats like Iceberg or Delta Lake. Queries are pushed down to DuckDB, a high performance analytical query engine. +`pg_analytics` uses DuckDB v1.0.0 and is supported on Postgres 13+. ### Motivation -Today, a vast amount of non-operational data — events, metrics, historical snapshots, vendor data, etc. — is ingested into data lakes like S3. Querying this data by moving it into a cloud data warehouse or operating a new query engine is expensive and time-consuming. The goal of `pg_analytics` is to enable this data to be queried directly from Postgres. This eliminates the need for new infrastructure, loss of data freshness, data movement, and non-Postgres dialects of other query engines. +Today, a vast amount of non-operational data — events, metrics, historical snapshots, vendor data, etc. — is ingested into data lakes like AWS S3. Querying this data by moving it into a cloud data warehouse or operating a new query engine is expensive and time-consuming. The goal of `pg_analytics` is to enable this data to be queried directly from Postgres. This eliminates the need for new infrastructure, loss of data freshness, data movement, and non-Postgres dialects of other query engines. `pg_analytics` uses the foreign data wrapper (FDW) API to connect to any object store or table format and the executor hook API to push queries to DuckDB. While other FDWs like `aws_s3` have existed in the Postgres extension ecosystem, these FDWs suffer from two limitations: @@ -24,31 +29,33 @@ Today, a vast amount of non-operational data — events, metrics, historical sna ### Roadmap -- [ ] Read support for `pg_analytics` -- [ ] Write support for `pg_analytics` -- [ ] `EXPLAIN` support -- [ ] Automatic schema detection +- [x] Read support for `pg_analytics` +- [ ] (In progress) Write support for `pg_analytics` +- [x] `EXPLAIN` support +- [x] `VIEW` support +- [x] Automatic schema detection - [ ] Integration with the catalog providers #### Object Stores -- [x] Amazon S3 +- [x] AWS S3 - [x] S3-compatible stores (MinIO, R2) +- [x] Google Cloud Storage - [x] Azure Blob Storage - [x] Azure Data Lake Storage Gen2 -- [x] Google Cloud Storage +- [x] HuggingFace - [x] HTTP server - [x] Local file system -#### Table Formats +#### File/Table Formats - [x] Parquet - [x] CSV -- [x] Apache Iceberg -- [x] Delta Lake - [x] JSON - -`pg_analytics` uses DuckDB v1.0.0 and is supported on Postgres 17, 16, 15, 14 and 13. +- [x] Geospatial (`.geojson`, `.xlsx`) +- [x] Delta Lake +- [x] Apache Iceberg +- [ ] Apache Hudi ## Installation @@ -57,49 +64,30 @@ Today, a vast amount of non-operational data — events, metrics, historical sna The easiest way to use the extension is to run the ParadeDB Dockerfile: ```bash -docker run \ - --name paradedb \ - -e POSTGRESQL_USERNAME= \ - -e POSTGRESQL_PASSWORD= \ - -e POSTGRESQL_DATABASE= \ - -e POSTGRESQL_POSTGRES_PASSWORD= \ - -v paradedb_data:/bitnami/postgresql \ - -p 5432:5432 \ - -d \ - paradedb/paradedb:latest +docker run --name paradedb -e POSTGRES_PASSWORD=password paradedb/paradedb +docker exec -it paradedb psql -U postgres ``` -This will spin up a Postgres instance with `pg_analytics` preinstalled. +This will spin up a PostgreSQL 16 instance with `pg_analytics` preinstalled. ### From Self-Hosted PostgreSQL -If you are self-hosting Postgres and would like to use the extension within your existing Postgres, follow the steps below. - -It's **very important** to make the following change to your `postgresql.conf` configuration file. `pg_analytics` must be in the list of `shared_preload_libraries`: +Because this extension uses Postgres hooks to intercept and push queries down to DuckDB, it is **very important** that it is added to `shared_preload_libraries` inside `postgresql.conf`. -```c +```bash +# Inside postgresql.conf shared_preload_libraries = 'pg_analytics' ``` -This ensures the best query performance from the extension . +This ensures the best query performance from the extension. -#### Debian/Ubuntu +#### Linux -We provide prebuilt binaries for Debian-based Linux for Postgres 17, 16, 15, 14 and 13. You can download the latest version for your architecture from the [releases page](https://github.com/paradedb/paradedb/releases). - -ParadeDB collects anonymous telemetry to help us understand how many people are using the project. You can opt out of telemetry by setting `export PARADEDB_TELEMETRY=false` (or unsetting the variable) in your shell or in your `~/.bashrc` file before running the extension. +We provide prebuilt binaries for Debian, Ubuntu, and Red Hat Enterprise Linux for Postgres 14+. You can download the latest version for your architecture from the [GitHub Releases page](https://github.com/paradedb/paradedb/releases). #### macOS -We don't suggest running production workloads on macOS. As a result, we don't provide prebuilt binaries for macOS. If you are running Postgres on macOS and want to install `pg_analytics`, please follow the [development](#development) instructions, but do `cargo pgrx install --release` instead of `cargo pgrx run`. This will build the extension from source and install it in your Postgres instance. - -You can then create the extension in your database by running: - -```sql -CREATE EXTENSION pg_analytics; -``` - -Note: If you are using a managed Postgres service like Amazon RDS, you will not be able to install `pg_analytics` until the Postgres service explicitly supports it. +At this time, we do not provide prebuilt binaries for macOS. If you are running Postgres on macOS and want to install `pg_analytics`, please follow the [development](#development) instructions, replacing `cargo pgrx run` by `cargo pgrx install --release`. This will build the extension from source and install it in your macOS Postgres instance (e.g. Homebrew). #### Windows @@ -129,16 +117,9 @@ SELECT COUNT(*) FROM trips; (1 row) ``` -To query your own data, please refer to the [documentation](https://docs.paradedb.com/analytics/object_stores). - -## Shared Preload Libraries +## Documentation -Because this extension uses Postgres hooks to intercept and push queries down to DuckDB, it is **very important** that it is added to `shared_preload_libraries` inside `postgresql.conf`. - -```bash -# Inside postgresql.conf -shared_preload_libraries = 'pg_analytics' -``` +TODO: Add documentation breakdown here ## Development @@ -155,9 +136,7 @@ rustup default Note: While it is possible to install Rust via your package manager, we recommend using `rustup` as we've observed inconsistencies with Homebrew's Rust installation on macOS. -Then, install the PostgreSQL version of your choice using your system package manager. Here we provide the commands for the default PostgreSQL version used by this project: - -### Install Other Dependencies +### Install Dependencies Before compiling the extension, you'll need to have the following dependencies installed. @@ -166,10 +145,10 @@ Before compiling the extension, you'll need to have the following dependencies i brew install make gcc pkg-config openssl # Ubuntu -sudo apt-get install -y make gcc pkg-config libssl-dev +sudo apt-get install -y make gcc pkg-config libssl-dev libclang-dev # Arch Linux -sudo pacman -S core/openssl +sudo pacman -S core/openssl extra/clang ``` ### Install Postgres @@ -198,7 +177,7 @@ export PATH="$PATH:/Applications/Postgres.app/Contents/Versions/latest/bin" Then, install and initialize `pgrx`: ```bash -# Note: Replace --pg17 with your version of Postgres, if different (i.e. --pg17, --pg16, --pg15, --pg14, --pg13 etc.) +# Note: Replace --pg17 with your version of Postgres, if different (i.e. --pg16) cargo install --locked cargo-pgrx --version 0.12.6 # macOS arm64 @@ -216,18 +195,6 @@ cargo pgrx init --pg17=/usr/bin/pg_config If you prefer to use a different version of Postgres, update the `--pg` flag accordingly. -Note: While it is possible to develop using pgrx's own Postgres installation(s), via `cargo pgrx init` without specifying a `pg_config` path, we recommend using your system package manager's Postgres as we've observed inconsistent behaviours when using pgrx's. - -`pgrx` requires `libclang`. To install it: - -```bash -# Ubuntu -sudo apt install libclang-dev - -# Arch Linux -sudo pacman -S extra/clang -``` - ### Running the Extension First, start pgrx: @@ -242,7 +209,7 @@ This will launch an interactive connection to Postgres. Inside Postgres, create CREATE EXTENSION pg_analytics; ``` -Now, you have access to all the extension functions. +You now have access to all the extension functions. ### Modifying the Extension @@ -273,4 +240,4 @@ DATABASE_URL=postgres://@:/ ## License -`pg_analytics` is licensed under the [PostgreSQL License](https://www.postgresql.org/about/licence/) and as commercial software. For commercial licensing, please contact us at [sales@paradedb.com](mailto:sales@paradedb.com). +`pg_analytics` is licensed under the [PostgreSQL License](https://www.postgresql.org/about/licence/). diff --git a/sql/pg_analytics--0.2.1--0.2.2.sql b/sql/pg_analytics--0.2.1--0.2.2.sql new file mode 100644 index 00000000..0adbaa09 --- /dev/null +++ b/sql/pg_analytics--0.2.1--0.2.2.sql @@ -0,0 +1 @@ +\echo Use "ALTER EXTENSION pg_analytics UPDATE TO '0.2.2'" to load this file. \quit diff --git a/src/api/mod.rs b/src/api/mod.rs index 595b4f4b..2cdb683c 100644 --- a/src/api/mod.rs +++ b/src/api/mod.rs @@ -18,4 +18,3 @@ mod csv; mod duckdb; mod parquet; -pub mod time_bucket; diff --git a/src/api/time_bucket.rs b/src/api/time_bucket.rs deleted file mode 100644 index 399c57a4..00000000 --- a/src/api/time_bucket.rs +++ /dev/null @@ -1,58 +0,0 @@ -// Copyright (c) 2023-2024 Retake, Inc. -// -// This file is part of ParadeDB - Postgres for Search and Analytics -// -// This program is free software: you can redistribute it and/or modify -// it under the terms of the GNU Affero General Public License as published by -// the Free Software Foundation, either version 3 of the License, or -// (at your option) any later version. -// -// This program is distributed in the hope that it will be useful -// but WITHOUT ANY WARRANTY; without even the implied warranty of -// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -// GNU Affero General Public License for more details. -// -// You should have received a copy of the GNU Affero General Public License -// along with this program. If not, see . - -use pgrx::{datum::*, pg_extern}; - -const TIME_BUCKET_FALLBACK_ERROR: &str = "Function `time_bucket()` must be used with a DuckDB FDW. Native postgres does not support this function. If you believe this function should be implemented natively as a fallback please submit a ticket to https://github.com/paradedb/pg_analytics/issues."; - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_date(_bucket_width: Interval, _input: Date) -> Date { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_date_origin(_bucket_width: Interval, _input: Date, _origin: Date) -> Date { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_date_offset(_bucket_width: Interval, _input: Date, _offset: Interval) -> Date { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_timestamp(_bucket_width: Interval, _input: Timestamp) -> Timestamp { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_timestamp_offset_date( - _bucket_width: Interval, - _input: Timestamp, - _origin: Date, -) -> Timestamp { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} - -#[pg_extern(name = "time_bucket")] -pub fn time_bucket_timestamp_offset_interval( - _bucket_width: Interval, - _input: Timestamp, - _offset: Interval, -) -> Timestamp { - panic!("{}", TIME_BUCKET_FALLBACK_ERROR); -} diff --git a/src/lib.rs b/src/lib.rs index 5e7d5654..af9cc566 100644 --- a/src/lib.rs +++ b/src/lib.rs @@ -28,10 +28,6 @@ use crate::debug_guc::DebugGucSettings; use hooks::ExtensionHook; use pgrx::*; -// TODO: Reactivate once we've properly integrated with the monorepo -// A static variable is required to host grand unified configuration settings. -// pub static GUCS: PostgresGlobalGucSettings = PostgresGlobalGucSettings::new(); - #[cfg(debug_assertions)] pub static DEBUG_GUCS: DebugGucSettings = DebugGucSettings::new(); @@ -47,10 +43,6 @@ pub extern "C" fn _PG_init() { register_hook(&mut EXTENSION_HOOK) }; - // TODO: Depends on above TODO - // GUCS.init("pg_analytics"); - // setup_telemetry_background_worker(ParadeExtension::PgAnalytics); - #[cfg(debug_assertions)] DEBUG_GUCS.init(); } diff --git a/tests/Cargo.toml b/tests/Cargo.toml index 591e4937..0c799b97 100644 --- a/tests/Cargo.toml +++ b/tests/Cargo.toml @@ -1,9 +1,9 @@ [package] name = "tests" -description = "test suite for pg_analytics" -version = "0.2.1" +description = "Test suite for pg_analytics" +version = "0.2.2" edition = "2021" -license = "AGPL-3.0" +license = "PostgreSQL" [lib] crate-type = ["rlib"] diff --git a/tests/README.md b/tests/README.md index f8340ec8..d849f9bb 100644 --- a/tests/README.md +++ b/tests/README.md @@ -1,11 +1,11 @@ -# Test suite for `pg_analytics +# Test suite for pg_analytics This is the test suite for the `pg_analytics` extension. An example of doing all that's necessary to run the tests is, from the root of the repo is: ```shell -#! /bin/sh +#!/bin/bash set -x export DATABASE_URL=postgresql://localhost:28816/pg_analytics diff --git a/tests/tests/datetime.rs b/tests/tests/datetime.rs index 0e14a35d..fb63de61 100644 --- a/tests/tests/datetime.rs +++ b/tests/tests/datetime.rs @@ -37,236 +37,6 @@ use tempfile::TempDir; use time::macros::datetime; use time::{Date, Month::January, PrimitiveDateTime}; -#[rstest] -async fn test_time_bucket_minutes_duckdb(mut conn: PgConnection, tempdir: TempDir) -> Result<()> { - let stored_batch = time_series_record_batch_minutes()?; - let parquet_path = tempdir.path().join("test_arrow_types.parquet"); - let parquet_file = File::create(&parquet_path)?; - - let mut writer = ArrowWriter::try_new(parquet_file, stored_batch.schema(), None).unwrap(); - writer.write(&stored_batch)?; - writer.close()?; - - primitive_setup_fdw_local_file_listing(parquet_path.as_path().to_str().unwrap(), "MyTable") - .execute(&mut conn); - - format!( - "CREATE FOREIGN TABLE timeseries () SERVER parquet_server OPTIONS (files '{}')", - parquet_path.to_str().unwrap() - ) - .execute(&mut conn); - - match "SELECT time_bucket(INTERVAL '2 DAY') AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;".execute_result(&mut conn) { - Ok(_) => { - panic!( - "should have failed call to time_bucket() for timeseries data with incorrect parameters" - ); - } - Err(err) => { - assert_eq!("error returned from database: function time_bucket(interval) does not exist", err.to_string()); - } - } - - let data: Vec<(NaiveDateTime, BigDecimal)> = "SELECT time_bucket(INTERVAL '10 MINUTE', timestamp::TIMESTAMP) AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - - assert_eq!(2, data.len()); - - let expected: Vec<(NaiveDateTime, BigDecimal)> = vec![ - ("1970-01-01T00:00:00".parse()?, BigDecimal::from_str("3")?), - ("1970-01-01T00:10:00".parse()?, BigDecimal::from_str("8")?), - ]; - assert_eq!(expected, data); - - let data: Vec<(NaiveDateTime, BigDecimal)> = "SELECT time_bucket(INTERVAL '1 MINUTE', timestamp::TIMESTAMP) AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - - assert_eq!(10, data.len()); - - let expected: Vec<(NaiveDateTime, BigDecimal)> = vec![ - ("1970-01-01T00:01:00".parse()?, BigDecimal::from_str("1")?), - ("1970-01-01T00:02:00".parse()?, BigDecimal::from_str("-1")?), - ("1970-01-01T00:03:00".parse()?, BigDecimal::from_str("0")?), - ("1970-01-01T00:04:00".parse()?, BigDecimal::from_str("2")?), - ("1970-01-01T00:05:00".parse()?, BigDecimal::from_str("3")?), - ("1970-01-01T00:06:00".parse()?, BigDecimal::from_str("4")?), - ("1970-01-01T00:07:00".parse()?, BigDecimal::from_str("5")?), - ("1970-01-01T00:08:00".parse()?, BigDecimal::from_str("6")?), - ("1970-01-01T00:09:00".parse()?, BigDecimal::from_str("7")?), - ("1970-01-01T00:10:00".parse()?, BigDecimal::from_str("8")?), - ]; - assert_eq!(expected, data); - - let data: Vec<(NaiveDateTime, BigDecimal)> = "SELECT time_bucket(INTERVAL '10 MINUTE', timestamp::TIMESTAMP, INTERVAL '5 MINUTE') AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - assert_eq!(2, data.len()); - - let expected: Vec<(NaiveDateTime, BigDecimal)> = vec![ - ( - "1969-12-31T23:55:00".parse()?, - BigDecimal::from_str("0.5000")?, - ), - ( - "1970-01-01T00:05:00".parse()?, - BigDecimal::from_str("5.5000")?, - ), - ]; - assert_eq!(expected, data); - - Ok(()) -} - -#[rstest] -async fn test_time_bucket_years_duckdb(mut conn: PgConnection, tempdir: TempDir) -> Result<()> { - let stored_batch = time_series_record_batch_years()?; - let parquet_path = tempdir.path().join("test_arrow_types.parquet"); - let parquet_file = File::create(&parquet_path)?; - - let mut writer = ArrowWriter::try_new(parquet_file, stored_batch.schema(), None).unwrap(); - writer.write(&stored_batch)?; - writer.close()?; - - primitive_setup_fdw_local_file_listing(parquet_path.as_path().to_str().unwrap(), "MyTable") - .execute(&mut conn); - - format!( - "CREATE FOREIGN TABLE timeseries () SERVER parquet_server OPTIONS (files '{}')", - parquet_path.to_str().unwrap() - ) - .execute(&mut conn); - - match "SELECT time_bucket(INTERVAL '2 DAY') AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;".execute_result(&mut conn) { - Ok(_) => { - panic!( - "should have failed call to time_bucket() for timeseries data with incorrect parameters" - ); - } - Err(err) => { - assert_eq!("error returned from database: function time_bucket(interval) does not exist", err.to_string()); - } - } - - let data: Vec<(Date, BigDecimal)> = "SELECT time_bucket(INTERVAL '1 YEAR', timestamp::DATE) AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - - assert_eq!(10, data.len()); - - let expected: Vec<(Date, BigDecimal)> = vec![ - ( - Date::from_calendar_date(1970, January, 1)?, - BigDecimal::from_str("1")?, - ), - ( - Date::from_calendar_date(1971, January, 1)?, - BigDecimal::from_str("-1")?, - ), - ( - Date::from_calendar_date(1972, January, 1)?, - BigDecimal::from_str("0")?, - ), - ( - Date::from_calendar_date(1973, January, 1)?, - BigDecimal::from_str("2")?, - ), - ( - Date::from_calendar_date(1974, January, 1)?, - BigDecimal::from_str("3")?, - ), - ( - Date::from_calendar_date(1975, January, 1)?, - BigDecimal::from_str("4")?, - ), - ( - Date::from_calendar_date(1976, January, 1)?, - BigDecimal::from_str("5")?, - ), - ( - Date::from_calendar_date(1977, January, 1)?, - BigDecimal::from_str("6")?, - ), - ( - Date::from_calendar_date(1978, January, 1)?, - BigDecimal::from_str("7")?, - ), - ( - Date::from_calendar_date(1979, January, 1)?, - BigDecimal::from_str("8")?, - ), - ]; - assert_eq!(expected, data); - - let data: Vec<(Date, BigDecimal)> = "SELECT time_bucket(INTERVAL '5 YEAR', timestamp::DATE) AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - - assert_eq!(2, data.len()); - - let expected: Vec<(Date, BigDecimal)> = vec![ - ( - Date::from_calendar_date(1970, January, 1)?, - BigDecimal::from_str("1")?, - ), - ( - Date::from_calendar_date(1975, January, 1)?, - BigDecimal::from_str("6")?, - ), - ]; - assert_eq!(expected, data); - - let data: Vec<(Date, BigDecimal)> = "SELECT time_bucket(INTERVAL '2 YEAR', timestamp::DATE, DATE '1969-01-01') AS bucket, AVG(value) as avg_value FROM timeseries GROUP BY bucket ORDER BY bucket;" - .fetch_result(&mut conn).unwrap(); - - assert_eq!(6, data.len()); - - let expected: Vec<(Date, BigDecimal)> = vec![ - ( - Date::from_calendar_date(1969, January, 1)?, - BigDecimal::from_str("1")?, - ), - ( - Date::from_calendar_date(1971, January, 1)?, - BigDecimal::from_str("-0.5000")?, - ), - ( - Date::from_calendar_date(1973, January, 1)?, - BigDecimal::from_str("2.5000")?, - ), - ( - Date::from_calendar_date(1975, January, 1)?, - BigDecimal::from_str("4.5000")?, - ), - ( - Date::from_calendar_date(1977, January, 1)?, - BigDecimal::from_str("6.5000")?, - ), - ( - Date::from_calendar_date(1979, January, 1)?, - BigDecimal::from_str("8")?, - ), - ]; - assert_eq!(expected, data); - - Ok(()) -} - -#[rstest] -async fn test_time_bucket_fallback(mut conn: PgConnection) -> Result<()> { - let error_message = "Function `time_bucket()` must be used with a DuckDB FDW. Native postgres does not support this function. If you believe this function should be implemented natively as a fallback please submit a ticket to https://github.com/paradedb/pg_analytics/issues."; - let trips_table = NycTripsTable::setup(); - trips_table.execute(&mut conn); - - match "SELECT time_bucket(INTERVAL '2 DAY', tpep_pickup_datetime::DATE) AS bucket, AVG(trip_distance) as avg_value FROM nyc_trips GROUP BY bucket ORDER BY bucket;".execute_result(&mut conn) { - Ok(_) => { - panic!("Should have error'ed when calling time_bucket() on non-FDW data.") - } - Err(error) => { - let a = error.to_string().contains(error_message); - assert!(a); - } - } - - Ok(()) -} - #[rstest] async fn test_date_trunc( mut conn: PgConnection, diff --git a/tests/tests/scan.rs b/tests/tests/scan.rs index decc9934..cdd71139 100644 --- a/tests/tests/scan.rs +++ b/tests/tests/scan.rs @@ -478,7 +478,6 @@ async fn test_complex_quals_pushdown(mut conn: PgConnection, tempdir: TempDir) - // make sure the result is correct with complex clauses. let rows: Vec<(i64,)> = query.fetch(&mut conn); - // TODO: check the plan. Wrappers not parse quals correctly. So there is not qual pushdown assert!( rows.len() == 2, "result error: rows length: {}\nquery: {}\n", From 1b484f99f911a00cf944a24192a94493a0c8b17e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philippe=20No=C3=ABl?= Date: Mon, 28 Oct 2024 23:54:53 +0000 Subject: [PATCH 2/4] Cleanup --- README.md | 2 +- docs/configuration/schema.mdx | 84 +++++++ docs/configuration/settings.mdx | 23 ++ docs/formats/csv.mdx | 331 +++++++++++++++++++++++++++ docs/formats/delta.mdx | 56 +++++ docs/formats/iceberg.mdx | 86 +++++++ docs/formats/json.mdx | 356 +++++++++++++++++++++++++++++ docs/formats/parquet.mdx | 175 ++++++++++++++ docs/formats/spatial.mdx | 116 ++++++++++ docs/object_stores/azure.mdx | 168 ++++++++++++++ docs/object_stores/gcs.mdx | 50 ++++ docs/object_stores/huggingface.mdx | 63 +++++ docs/object_stores/s3.mdx | 103 +++++++++ docs/overview.mdx | 61 +++++ tests/README.md | 3 + 15 files changed, 1676 insertions(+), 1 deletion(-) create mode 100644 docs/configuration/schema.mdx create mode 100644 docs/configuration/settings.mdx create mode 100644 docs/formats/csv.mdx create mode 100644 docs/formats/delta.mdx create mode 100644 docs/formats/iceberg.mdx create mode 100644 docs/formats/json.mdx create mode 100644 docs/formats/parquet.mdx create mode 100644 docs/formats/spatial.mdx create mode 100644 docs/object_stores/azure.mdx create mode 100644 docs/object_stores/gcs.mdx create mode 100644 docs/object_stores/huggingface.mdx create mode 100644 docs/object_stores/s3.mdx create mode 100644 docs/overview.mdx diff --git a/README.md b/README.md index d8bf5540..406091f6 100644 --- a/README.md +++ b/README.md @@ -119,7 +119,7 @@ SELECT COUNT(*) FROM trips; ## Documentation -TODO: Add documentation breakdown here +Complete documentation for `pg_analytics` can be found under the [docs](/docs/) folder as Markdown files. It covers how to query the various object stores and file and table formats supports, and how to configure and tune the extension. ## Development diff --git a/docs/configuration/schema.mdx b/docs/configuration/schema.mdx new file mode 100644 index 00000000..195b2e8b --- /dev/null +++ b/docs/configuration/schema.mdx @@ -0,0 +1,84 @@ +--- +title: Foreign Table Schema +--- + +## Auto Schema Creation + +If no columns are specified in `CREATE FOREIGN TABLE`, the appropriate Postgres schema will +automatically be created. + +```sql +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet'); +``` + +## Configure Columns + +The `select` option can be used to configure the columns mapped over the underlying file(s). This is useful for renaming, modifying, or +generating additional columns. `select` takes any string that can be passed to a SQL `SELECT` statement. By default, it is set to `*`, +which selects all columns as-is. + +```sql +-- Only use a subset of columns +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS ( + files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', + select 'vendorid, passenger_count' +); + +-- Rename columns +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS ( + files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', + select 'vendorid AS vendor_id, passenger_count AS passengers' +); + +-- Generate additional columns +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS ( + files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', + select '*, 2024 AS year, 1 AS month' +); + +-- Modify existing column +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS ( + files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', + select '(vendorid + 1) AS vendorid' +); +``` + +## Preserve Casing + +Whereas DuckDB preserves the casing of identifiers like column names by default, Postgres does not. +In Postgres, identifiers are automatically lowercased unless wrapped in double quotation marks. + +```sql Postgres +-- The following two statements are equivalent +CREATE TABLE MyTable (MyColumn a); +CREATE TABLE mytable (mycolumn a); + +-- Double quotes must be used to preserve casing +CREATE TABLE "MyTable" ("MyColumn" a); +``` + +By default, auto schema creation will create column names in lowercase. This can be +changed with the `preserve_casing` option, which tells auto schema creation to wrap column names in double +quotes. + +```sql +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS ( + files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet', + preserve_casing 'true' +); + +-- Columns are now case-sensitive +SELECT "RatecodeID" FROM trips LIMIT 1; +``` diff --git a/docs/configuration/settings.mdx b/docs/configuration/settings.mdx new file mode 100644 index 00000000..9ad8194a --- /dev/null +++ b/docs/configuration/settings.mdx @@ -0,0 +1,23 @@ +--- +title: DuckDB Settings +--- + +The `duckdb_execute` function can be used to change the underlying [configuration options](https://duckdb.org/docs/configuration/overview#global-configuration-options) +used by DuckDB. + +```sql +-- Dollar quoted strings are used to escape single quotes +SELECT duckdb_execute($$SET memory_limit='10GiB'$$); +``` + +`duckdb_settings` returns a table of all available settings. + +```sql +SELECT * FROM duckdb_settings(); +``` + + + Because a new DuckDB connection is created per Postgres connection, every new + Postgres connection uses the default DuckDB configuration. Changes to the + DuckDB configuration only apply to the current Postgres connection. + diff --git a/docs/formats/csv.mdx b/docs/formats/csv.mdx new file mode 100644 index 00000000..e1e83c17 --- /dev/null +++ b/docs/formats/csv.mdx @@ -0,0 +1,331 @@ +--- +title: CSV +--- + +## Overview + +This code block demonstrates how to query CSV file(s). + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER csv_fdw_handler +VALIDATOR csv_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER csv_wrapper +HANDLER csv_fdw_handler +VALIDATOR csv_fdw_validator; + +CREATE SERVER csv_server +FOREIGN DATA WRAPPER csv_wrapper; + +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS (files 's3://bucket/folder/file.csv'); + +```` + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +The path of a single CSV file or [multiple CSV files](#multiple-csv-files). +For instance, `s3://bucket/folder/file.csv` if the file is in Amazon S3, `https://domain.tld/file.csv` +if the file is on a HTTP server, or `/path/to/file.csv` if the file is on the local file system. + + +## CSV Options + +There are a number of options that can be passed into the `CREATE FOREIGN TABLE` statement. +These are the same [options](https://duckdb.org/docs/data/csv/overview#parameters) accepted +by DuckDB's `read_csv` function. + + +Option to skip type detection for CSV parsing and assume all columns to be of type`VARCHAR`. + + + +Option to allow the conversion of quoted values to `NULL` values. + + + +Enables auto detection of CSV parameters. See [Auto Detection](https://duckdb.org/docs/data/csv/auto_detection.html). + + + +This option allows you to specify the types that the sniffer will use when detecting CSV column types. +The `VARCHAR` type is always included in the detected types (as a fallback option). +See [Auto Type Candidates](https://duckdb.org/docs/data/csv/overview#auto_type_candidates-details). + + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + auto_type_candidates 'BIGINT, DATE' +); +```` + + + + + +A struct that specifies the column names and column types contained within the CSV file +(e.g., `{'col1': 'INTEGER', 'col2': 'VARCHAR'}`). Using this option implies that auto detection is +not used. + + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + columns $${'FlightDate': 'DATE', 'UniqueCarrier': 'VARCHAR'}$$ +); +``` + + + + + The compression type for the file. By default this will be detected + automatically from the file extension (e.g., `t.csv.gz` will use `gzip`, + `t.csv` will use `none`). Options are `none`, `gzip`, `zstd`. + + + + Specifies the date format to use when parsing dates. See [Date + Format](https://duckdb.org/docs/sql/functions/dateformat.html). + + + + The decimal separator of numbers. + + + + Specifies the delimiter character that separates columns within each row + (line) of the file. Alias for `sep`. + + + + Specifies the string that should appear before a data character sequence that + matches the quote value. + + + + Whether or not an extra filename column should be included in the result. + + + +Do not match the specified columns' values against the `NULL` string. In the default case where the `NULL` string is empty, this means that empty values will be read as zero-length strings rather than NULLs. + + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + force_not_null 'FlightDate, UniqueCarrier' +); +``` + + + + + Specifies that the file contains a header line with the names of each column + in the file. + + + + Whether or not to interpret the path as a Hive partitioned path. + + + +If `hive_partitioning` is enabled, `hive_types` can be used to specify the logical types of the hive +partitions in a struct. + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + hive_partitioning 'true', + hive_types $${'release': DATE, 'orders': BIGINT}$$ +); +``` + + + +hive_types will be autodetected for the following types: `DATE`, `TIMESTAMP` and `BIGINT`. +To switch off the autodetection, this option can be set to `0`. + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + hive_partitioning 'true', + hive_types $${'release': DATE, 'orders': BIGINT}$$, + hive_types_autocast '0' +); +``` + + + + + Option to ignore any parsing errors encountered and instead ignore rows with + errors. + + + + The maximum line size in bytes. + + + +The column names as a list if the file does not contain a header. + + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + names 'FlightDate, UniqueCarrier' +); +``` + + + + + Set the new line character(s) in the file. Options are '\r','\n', or '\r\n'. + + + + Boolean value that specifies whether or not column names should be normalized, + removing any non-alphanumeric characters from them. + + + + If this option is enabled, when a row lacks columns, it will pad the remaining + columns on the right with null values. + + + +Specifies the string that represents a `NULL` value or a list of strings that represent a `NULL` value. + + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + nullstr 'NULL, NONE' +); +``` + + + + + Whether or not the parallel CSV reader is used. + + + + Specifies the quoting string to be used when a data value is quoted. + + + + The number of sample rows for auto detection of parameters. + + + + Specifies the delimiter character that separates columns within each row + (line) of the file. Alias for `delim`. + + + + The number of lines at the top of the file to skip. + + + + Specifies the date format to use when parsing timestamps. See [Date + Format](https://duckdb.org/docs/sql/functions/dateformat.html). + + + +The column types as a list by position. + + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files 's3://bucket/folder/file.csv', + types 'BIGINT, DATE' +); +``` + + + + + Whether the columns of multiple schemas should be unified by name, rather than + by position. + + +## Multiple CSV Files + +To treat multiple CSV files as a single table, their paths should be passed in as a comma-separated +string. + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files '/path/to/file1.csv, /path/to/file2.csv' +); +``` + +To treat a directory of CSV files as a single table, the glob pattern should be used. + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files '/folder/*.csv', +); +``` + +The glob pattern can also be used to read all CSV files from multiple directories. + +```sql +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS ( + files '/folder1/*.csv, /folder2/*.csv' +); +``` + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +CSV files stored in object stores like S3. diff --git a/docs/formats/delta.mdx b/docs/formats/delta.mdx new file mode 100644 index 00000000..afd103bd --- /dev/null +++ b/docs/formats/delta.mdx @@ -0,0 +1,56 @@ +--- +title: Delta Lake +--- + +## Overview + +This code block demonstrates how to query a Delta table. + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER delta_fdw_handler +VALIDATOR delta_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER delta_wrapper +HANDLER delta_fdw_handler +VALIDATOR delta_fdw_validator; + +CREATE SERVER delta_server +FOREIGN DATA WRAPPER delta_wrapper; + +CREATE FOREIGN TABLE delta_table () +SERVER delta_server +OPTIONS (files 's3://bucket/folder'); + +``` + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +The path to the Delta table directory. For instance, `s3://bucket/folder` if the Delta table is in Amazon S3 or +`/path/to/folder` if the Delta table is on the local file system. + + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +Delta tables stored in object stores like S3. +``` diff --git a/docs/formats/iceberg.mdx b/docs/formats/iceberg.mdx new file mode 100644 index 00000000..c417c840 --- /dev/null +++ b/docs/formats/iceberg.mdx @@ -0,0 +1,86 @@ +--- +title: Iceberg +--- + +## Overview + +This code block demonstrates how to query an Iceberg table. + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER iceberg_fdw_handler +VALIDATOR iceberg_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER iceberg_wrapper +HANDLER iceberg_fdw_handler +VALIDATOR iceberg_fdw_validator; + +CREATE SERVER iceberg_server +FOREIGN DATA WRAPPER iceberg_wrapper; + +CREATE FOREIGN TABLE iceberg_table () +SERVER iceberg_server +OPTIONS (files 's3://bucket/folder'); + +```` + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +The path to the Iceberg table. For instance, `s3://bucket/folder` if the Iceberg table is in Amazon S3 or +`/path/to/folder` if the Iceberg table is on the local file system. + + +## Allow Moved Paths + +The `allow_moved_paths` option ensures that some path resolution is performed, which allows scanning Iceberg tables that are moved. + +```sql +CREATE FOREIGN TABLE iceberg_table () +SERVER iceberg_server +OPTIONS ( + files 's3://bucket/folder', + allow_moved_paths 'true' +); +```` + +## Linking to the Manifest File + +If no `version-hint.text` file is found in the Iceberg metadata, the following error will be thrown: + +``` +Error: IO Error: Cannot open file "s3://⟨bucket⟩/⟨iceberg-table-folder⟩/metadata/version-hint.text": No such file or directory +``` + +Providing the path to the `.metadata.json` manifest will circumvent this error. + +```sql +CREATE FOREIGN TABLE iceberg_table () +SERVER iceberg_server +OPTIONS ( + files 's3://⟨bucket⟩/⟨iceberg-table-folder⟩/metadata/⟨id⟩.metadata.json', +); +``` + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +Iceberg tables stored in object stores like S3. diff --git a/docs/formats/json.mdx b/docs/formats/json.mdx new file mode 100644 index 00000000..355327af --- /dev/null +++ b/docs/formats/json.mdx @@ -0,0 +1,356 @@ +--- +title: JSON +--- + +## Overview + +This code block demonstrates how to query JSON file(s). + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER json_fdw_handler +VALIDATOR json_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER json_wrapper +HANDLER json_fdw_handler +VALIDATOR json_fdw_validator; + +CREATE SERVER json_server +FOREIGN DATA WRAPPER json_wrapper; + +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS (files 's3://bucket/folder/file.json'); + +```` + + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +The path of a single JSON file or [multiple JSON files](#multiple-json-files). +For instance, `s3://bucket/folder/file.json` if the file is in Amazon S3 or `/path/to/file.json` +if the file is on the local file system. + + +## JSON Options + +There are a number of options that can be passed into the `CREATE FOREIGN TABLE` statement. +These are the same [options](https://duckdb.org/docs/data/json/overview#parameters) accepted +by DuckDB's `read_json` function. + + + +Enables auto detection of key names and value types. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + auto_detect 'true' +); +```` + + + + + + +Specifies key names and value types in the JSON file (e.g. `{key1: 'INTEGER', key2: 'VARCHAR'}`). If `auto_detect` is enabled the value of this setting will be inferred from the JSON file contents. + + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + columns $${key1: 'INTEGER', key2: 'VARCHAR'}$$ +); +``` + + + + +The compression type for the file. By default this will be detected automatically from the file extension (e.g., `t.json.gz` will use `gzip`, `t.json` will use `none`). Options are `uncompressed`, `gzip`, `zstd`, and `auto_detect`. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + compression 'gzip' +); +``` + + + + +Whether strings representing integer values should be converted to a numerical type. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + convert_strings_to_integers 'true' +); +``` + + + + +Specifies the date format to use when parsing dates. See [Date Format](https://duckdb.org/docs/sql/functions/dateformat.html) + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + dateformat '%d/%m/%Y' +); +``` + + + + +Whether or not an extra filename column should be included in the result. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + filename 'false' +); +``` + + + + +Can be one of `auto`, `unstructured`, `newline_delimited` and `array` + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + format 'unstructured' +); +``` + + + + +Whether or not to interpret the path as a Hive partitioned path. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + hive_partitioning 'true' +); +``` + + + + +Whether to ignore parse errors (only possible when format is `newline_delimited`) + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + ignore_errors 'false' +); +``` + + + + +Maximum nesting depth to which the automatic schema detection detects types. Set to `-1` to fully detect nested JSON types. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + maximum_depth '65536' +); +``` + + + + +The maximum size of a JSON object (in bytes). + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + maximum_object_size '65536' +); +``` + + + + +Determines whether the fields of JSON object will be unpacked into individual columns. +Can be one of `auto`, `true` or `false` + +Suppose we have a JSON file with these contents: + +```json +{"key1":"value1", "key2": "value1"} +{"key1":"value2", "key2": "value2"} +{"key1":"value3", "key2": "value3"} +``` + +Reading it with `records` set to `true` will result in these table contents: + +```csv + key1 | key2 +-----------------+ + value1 | value1 + value2 | value2 + value3 | value3 +``` + +Reading it with `records` set to `false` will result in these table contents: + +```csv + json +---------------------------------+ + {'key1': value1, 'key2': value1} + {'key1': value2, 'key2': value2} + {'key1': value3, 'key2': value3} +``` + +If set to `auto` DuckDB will try to determine the desired behaviour. See [DuckDB documentation](https://duckdb.org/docs/data/json/overview#examples-of-records-settings) for more details. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + records 'auto' +); +``` + + + + +Option to define number of sample objects for automatic JSON type detection. Set to `-1` to scan the entire input file + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + sample_size '4086' +); +``` + + + + +Specifies the date format to use when parsing timestamps. See [Date Format](https://duckdb.org/docs/sql/functions/dateformat.html) + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + timestampformat 'iso' +); +``` + + + + +Whether the schema's of multiple JSON files should be unified. + + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files 's3://bucket/folder/file.json', + union_by_name 'false' +); +``` + + + +## Multiple JSON Files + +To treat multiple JSON files as a single table, their paths should be passed in as a comma-separated +string. + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files '/path/to/file1.json, /path/to/file2.json' +); +``` + +To treat a directory of JSON files as a single table, the glob pattern should be used. + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files '/folder/*.json', +); +``` + +The glob pattern can also be used to read all JSON files from multiple directories. + +```sql +CREATE FOREIGN TABLE json_table () +SERVER json_server +OPTIONS ( + files '/folder1/*.json, /folder2/*.json' +); +``` + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +JSON files stored in object stores like S3. diff --git a/docs/formats/parquet.mdx b/docs/formats/parquet.mdx new file mode 100644 index 00000000..ae7cea76 --- /dev/null +++ b/docs/formats/parquet.mdx @@ -0,0 +1,175 @@ +--- +title: Parquet +--- + +## Overview + +This code block demonstrates how to query Parquet file(s). + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER parquet_fdw_handler +VALIDATOR parquet_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER parquet_wrapper +HANDLER parquet_fdw_handler +VALIDATOR parquet_fdw_validator; + +CREATE SERVER parquet_server +FOREIGN DATA WRAPPER parquet_wrapper; + +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 's3://bucket/folder/file.parquet'); + +```` + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +The path of a single Parquet file or [multiple Parquet files](#multiple-parquet-files). +For instance, `s3://bucket/folder/file.parquet` if the file is in Amazon S3, `https://domain.tld/file.parquet` +if the file is on a HTTP server, or `/path/to/file.parquet` if the file is on the local file system. + + +## Parquet Options + +There are a number of options that can be passed into the `CREATE FOREIGN TABLE` statement. +These are the same [options](https://duckdb.org/docs/data/parquet/overview#parameters) accepted +by DuckDB's `read_parquet` function. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files 's3://bucket/folder/file.parquet', + binary_as_string 'true', + hive_partitioning 'true' +); +```` + + +The path of a single Parquet file or [multiple Parquet files](#multiple-parquet-files). +For instance, `s3://bucket/folder/file.parquet` if the file is in Amazon S3 or `/path/to/file.parquet` +if the file is on the local file system. + + +Parquet files generated by legacy writers do not correctly set the `UTF8` flag for strings, +causing string columns to be loaded as `BLOB` instead. Set this to true to load binary columns as +strings. + + +Whether or not an extra `filename` column should be included in the result. + + +Whether or not to include the `file_row_number` column. + + +Whether or not to interpret the path as a Hive partitioned path. + + +If `hive_partitioning` is enabled, `hive_types` can be used to specify the logical types of the hive +partitions in a struct. + +```sql +-- Dollar-quoted strings are used to contain single quotes +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files 's3://bucket/folder/file.parquet', + hive_partitioning 'true', + hive_types $${'release': DATE, 'orders': BIGINT}$$ +); +``` + + + +hive_types will be autodetected for the following types: `DATE`, `TIMESTAMP` and `BIGINT`. +To switch off the autodetection, this option can be set to `0`. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files 's3://bucket/folder/file.parquet', + hive_partitioning 'true', + hive_types $${'release': DATE, 'orders': BIGINT}$$, + hive_types_autocast '0' +); +``` + + + +Whether the columns of multiple schemas should be unified by name, rather than by position. + + +## Multiple Parquet Files + +To treat multiple Parquet files as a single table, their paths should be passed in as a comma-separated +string. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files '/path/to/file1.parquet, /path/to/file2.parquet' +); +``` + +To treat a directory of Parquet files as a single table, the glob pattern should be used. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files '/folder/*.parquet', +); +``` + +The glob pattern can also be used to read all Parquet files from multiple directories. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files '/folder1/*.parquet, /folder2/*.parquet' +); +``` + +## Parquet Schema + +The `parquet_describe` function returns the column names and types contained within a Parquet file. This function is useful +for determining the schema of the Postgres foreign table. + +```sql +SELECT * FROM parquet_describe('/path/to/file.parquet') +``` + +The `parquet_schema` function returns the internal schema contained within the metadata of a Parquet file. + +```sql +SELECT * FROM parquet_schema('/path/to/file.parquet'); +``` + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +Parquet files stored in object stores like S3. diff --git a/docs/formats/spatial.mdx b/docs/formats/spatial.mdx new file mode 100644 index 00000000..e31ff996 --- /dev/null +++ b/docs/formats/spatial.mdx @@ -0,0 +1,116 @@ +--- +title: Geospatial +--- + +## Overview + +This code block demonstrates how to query a geospatial file. The supported file types are `.geojson` and `.xlsx`. + +```sql +CREATE FOREIGN DATA WRAPPER +HANDLER spatial_fdw_handler +VALIDATOR spatial_fdw_validator; + +CREATE SERVER +FOREIGN DATA WRAPPER ; + +CREATE FOREIGN TABLE () +SERVER +OPTIONS (files ''); +``` + + +```sql +CREATE FOREIGN DATA WRAPPER spatial_wrapper +HANDLER spatial_fdw_handler +VALIDATOR spatial_fdw_validator; + +CREATE SERVER spatial_server +FOREIGN DATA WRAPPER spatial_wrapper; + +CREATE FOREIGN TABLE spatial_table () +SERVER spatial_server +OPTIONS (files 's3://bucket/folder/file.geojson'); + +```` + + + + Foreign data wrapper name. Can be any string. + + + Foreign server name. Can be any string. + + + Foreign table name. Can be any string. + + +For instance, `s3://bucket/folder/file.geojson` if the file is in Amazon S3, `https://domain.tld/file.geojson` +if the file is on a HTTP server, or `/path/to/file.geojson` if the file is on the local file system. + + +## Geospatial Options + +There are a number of options that can be passed into the `CREATE FOREIGN TABLE` statement. +These are the same [options](https://duckdb.org/docs/extensions/spatial#st_read--read-spatial-data-from-files) accepted +by DuckDB's `st_read` function in the `spatial` extension. + +```sql +CREATE FOREIGN TABLE spatial_table () +SERVER spatial_server +OPTIONS ( + files 's3://bucket/folder/file.geojson', + layer 'layer_name' +); +``` + + + The path of a single geospatial file. For instance, + `s3://bucket/folder/file.geojson` if the file is in Amazon S3 or + `/path/to/file.geojson` if the file is on the local file system. + + + If set to `true`, the table function will scan through all layers sequentially + and return the first `layer` that matches the given layer name. This is + required for some drivers to work properly, e.g., the `OSM` driver. + + + If set to a WKB blob, the table function will only return rows that intersect + with the given WKB geometry. Some drivers may support efficient spatial + filtering natively, in which case it will be pushed down. Otherwise the + filtering is done by GDAL which may be much slower. + + + A list of key-value pairs that are passed to the GDAL driver to control the + opening of the file. E.g., the `GeoJSON` driver supports a + `FLATTEN_NESTED_ATTRIBUTES=YES` option to flatten nested attributes. + + + The name of the layer to read from the file. If `NULL`, the first layer is + returned. Can also be a layer index (starting at 0). + + + A list of GDAL driver names that are allowed to be used to open the file. If + empty, all drivers are allowed. + + + A list of sibling files that are required to open the file. E.g., the `ESRI + Shapefile` driver requires a `.shx` file to be present. Although most of the + time these can be discovered automatically. + + + If set to a BOX_2D, the table function will only return rows that intersect + with the given bounding box. Similar to `spatial_filter`. + + + If set, the table function will return geometries in a `wkb_geometry` column + with the type `WKB_BLOB` (which can be cast to `BLOB`) instead of `GEOMETRY`. + This is useful if you want to use DuckDB with more exotic geometry subtypes + that DuckDB spatial doesn't support representing in the `GEOMETRY` type yet. + + +## Cloud Object Stores + +The [object stores](/integrations/object_stores) documentation explains how to provide secrets and other credentials for +geospatial files stored in object stores like S3. +```` diff --git a/docs/object_stores/azure.mdx b/docs/object_stores/azure.mdx new file mode 100644 index 00000000..c4833e2b --- /dev/null +++ b/docs/object_stores/azure.mdx @@ -0,0 +1,168 @@ +--- +title: Azure +--- + +## Overview + +This code block demonstrates how to query Parquet file(s) stored in Azure. The file path must start with an Azure +scheme such as `az`, `azure`, or `abfss`. + +```sql +-- Parquet format is assumed +CREATE FOREIGN DATA WRAPPER parquet_wrapper +HANDLER parquet_fdw_handler +VALIDATOR parquet_fdw_validator; + +CREATE SERVER parquet_server +FOREIGN DATA WRAPPER parquet_wrapper; + +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 'az:////.parquet'); +``` + +The glob pattern can be used to query a directory of files. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 'az:////*.parquet'); +``` + +Fully-qualified path syntax is also supported. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS ( + files 'az://.blob.core.windows.net///*.parquet' +); +``` + +## Providing Credentials + +`CREATE USER MAPPING` is used to provide Azure credentials. These credentials are tied to a specific Postgres user, which enables +multiple users to query the same foreign table with their own credentials. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + connection_string '' +); +``` + + + The name of the Postgres user. If set to `public`, these credentials will be + applied to all users. `SELECT current_user` can be used to get the name of the + current Postgres user. + + + Foreign server name. + + +There are several ways to authenticate with Azure: via a connection string, the Azure credential chain, or an Azure Service Principal. + +## Connection String + +The following code block demonstrates how to use a connection string. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + connection_string '' +); +``` + +If authentication is not used, a storage account name must be provided. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + account_name '' +); +``` + +## Credential Chain + +The `CREDENTIAL CHAIN` provider allows connecting using credentials automatically fetched by the Azure SDK via the Azure credential chain. By default, +the `DefaultAzureCredential` chain used, which tries credentials according to the order specified by the +[Azure documentation](https://learn.microsoft.com/en-us/javascript/api/@azure/identity/defaultazurecredential?view=azure-node-latest#@azure-identity-defaultazurecredential-constructor). + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + provider 'CREDENTIAL_CHAIN', + account_name '' +); +``` + +The `chain` option can be used to specify a specific chain. This takes a semicolon-separated list of providers that will be tried in order. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + provider 'CREDENTIAL_CHAIN', + chain 'cli;env', + account_name '' +); +``` + +The available chains are `cli`, `env`, `managed_identity`, and `default`. + +## Service Principal + +The service principal provider allows connecting using a [Azure Service Principal (SPN)](https://learn.microsoft.com/en-us/entra/architecture/service-accounts-principal). + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + provider 'SERVICE_PRINCIPAL', + tenant_id '', + client_id '', + client_secret '', + account_name '' +); +``` + +If a certificate is present on the same Postgres instance, it can also be used. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + provider 'SERVICE_PRINCIPAL', + tenant_id '', + client_id '', + client_certificate_path '', + account_name '' +); +``` + +## Configuring a Proxy + +The following code block demonstrates how to configure proxy information. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'AZURE', + connection_string '', + http_proxy 'http://localhost:3128', + proxy_user_name 'john', + proxy_password 'doe' +); +``` diff --git a/docs/object_stores/gcs.mdx b/docs/object_stores/gcs.mdx new file mode 100644 index 00000000..30d5d74b --- /dev/null +++ b/docs/object_stores/gcs.mdx @@ -0,0 +1,50 @@ +--- +title: Google Cloud Storage +--- + +## Overview + +This code block demonstrates how to query Parquet file(s) stored in Google Cloud Storage (GCS). +The file path must start with `gs://`. + +```sql +-- Parquet format is assumed +CREATE FOREIGN DATA WRAPPER parquet_wrapper +HANDLER parquet_fdw_handler +VALIDATOR parquet_fdw_validator; + +CREATE SERVER parquet_server +FOREIGN DATA WRAPPER parquet_wrapper; + +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 'gs:////.parquet'); +``` + +The glob pattern can be used to query a directory of files. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 'gs:////*.parquet'); +``` + +## Providing Credentials + +`CREATE USER MAPPING` is used to provide GCS credentials. These credentials are tied to a specific Postgres user, which enables +multiple users to query the same foreign table with their own credentials. + +[HMAC keys](https://console.cloud.google.com/storage/settings;tab=interoperability) are used for authentication. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'GCS', + key_id '', + secret '' +); +``` + +Because GCS is accessed with the S3 API, GCS accepts the same user mapping options as S3. +Please see the [S3 documentation](/integrations/object_stores/s3#credentials-options) for other available options. diff --git a/docs/object_stores/huggingface.mdx b/docs/object_stores/huggingface.mdx new file mode 100644 index 00000000..e15291e0 --- /dev/null +++ b/docs/object_stores/huggingface.mdx @@ -0,0 +1,63 @@ +--- +title: Hugging Face +--- + +## Overview + +This code block demonstrates how to query machine learning datasets from the Hugging Face Datasets library. The file path must start with `hf://`. + +```sql +-- CSV format is assumed +CREATE FOREIGN DATA WRAPPER csv_wrapper +HANDLER csv_fdw_handler +VALIDATOR csv_fdw_validator; + +CREATE SERVER csv_server +FOREIGN DATA WRAPPER csv_wrapper; + +CREATE FOREIGN TABLE csv_table () +SERVER csv_server +OPTIONS (files 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv'); +``` + +## Providing Credentials + +`CREATE USER MAPPING` is used to provide Hugging Face credentials. These credentials are tied to a specific Postgres user, which enables +multiple users to query the same foreign table with their own credentials. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'HUGGINGFACE', + token '' +); +``` + + + The name of the Postgres user. If set to `public`, these credentials will be + applied to all users. `SELECT current_user` can be used to get the name of the + current Postgres user. + + + Foreign server name. + + +## Credentials Options + +The following options can be passed into `CREATE USER MAPPING`: + +Your Hugging Face token. + +## Credential Chain Provider + +The `CREDENTIAL_CHAIN` provider allows connecting using credentials automatically fetched from `~/.cache/huggingface/token`. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'HUGGINGFACE', + provider 'CREDENTIAL_CHAIN', +); +``` diff --git a/docs/object_stores/s3.mdx b/docs/object_stores/s3.mdx new file mode 100644 index 00000000..30418c1b --- /dev/null +++ b/docs/object_stores/s3.mdx @@ -0,0 +1,103 @@ +--- +title: S3 +--- + +## Overview + +S3 foreign tables have been tested against the following object stores that implement the S3 API: Amazon S3, MinIO, Cloudflare R2, and +Google Cloud. + +This code block demonstrates how to query Parquet file(s) stored in Google Cloud Storage (GCS). +The file path must start with `s3`, `r2`, or `gs`. + +```sql +-- Parquet format is assumed +CREATE FOREIGN DATA WRAPPER parquet_wrapper +HANDLER parquet_fdw_handler +VALIDATOR parquet_fdw_validator; + +CREATE SERVER parquet_server +FOREIGN DATA WRAPPER parquet_wrapper; + +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 's3:////.parquet'); +``` + +The glob pattern can be used to query a directory of files. + +```sql +CREATE FOREIGN TABLE parquet_table () +SERVER parquet_server +OPTIONS (files 's3:////*.parquet'); +``` + +## Providing Credentials + +`CREATE USER MAPPING` is used to provide S3 credentials. These credentials are tied to a specific Postgres user, which enables +multiple users to query the same foreign table with their own credentials. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'S3', + key_id '', + secret '', + region 'us-east-1' +); +``` + + + The name of the Postgres user. If set to `public`, these credentials will be + applied to all users. `SELECT current_user` can be used to get the name of the + current Postgres user. + + + Foreign server name. + + +## Credentials Options + +The following options can be passed into `CREATE USER MAPPING`: + + + Must be one of `S3`, `GCS`, or `R2`. + +The secret key. + + The region for which to authenticate (should match the region of the bucket to + query). + +A session token. +Specify a custom S3 endpoint. + + Either `vhost` or `path`. The default for S3 is `vhost` and the default for R2 + and GCS is `path`. + + + Whether to use HTTPS or HTTP. + + + Can help when URLs contain problematic characters. + + + The R2 account ID to use for generating the endpoint URL. + + +## Credential Chain Provider + +Providing credentials via `key_id` and `secret` requires permanent AWS IAM/Identity Center keys. The `CREDENTIAL_CHAIN` provider can +automatically fetch ephemeral credentials using mechanisms provided by the AWS SDK. + +```sql +CREATE USER MAPPING FOR +SERVER +OPTIONS ( + type 'S3', + provider 'CREDENTIAL_CHAIN', + CHAIN 'env;config;instance' +); +``` + +The following values can be passed into `CHAIN`: `config`, `sts`, `sso`, `env`, `instance`, `process`. diff --git a/docs/overview.mdx b/docs/overview.mdx new file mode 100644 index 00000000..eced5995 --- /dev/null +++ b/docs/overview.mdx @@ -0,0 +1,61 @@ +--- +title: Overview +--- + +Today, a vast amount of data is stored in + +1. File formats like Parquet or CSV +2. Data lakes like S3 or GCS +3. Table formats like Delta Lake or Iceberg + +ParadeDB's integrations that make it easy to ingest this data without data processing engines or ETL tools, which can be complex and error-prone. + +## Basic Usage + +In this example, we will query and copy a Parquet file stored in S3 to Postgres. The Parquet file +contains 3 million NYC taxi trips from January 2024, hosted in a public S3 bucket provided by ParadeDB. + +To begin, let's create a [Postgres foreign data wrapper](https://wiki.postgresql.org/wiki/Foreign_data_wrappers), which is how ParadeDB connects to S3. + +```sql +CREATE FOREIGN DATA WRAPPER parquet_wrapper +HANDLER parquet_fdw_handler VALIDATOR parquet_fdw_validator; + +CREATE SERVER parquet_server FOREIGN DATA WRAPPER parquet_wrapper; + +CREATE FOREIGN TABLE trips () +SERVER parquet_server +OPTIONS (files 's3://paradedb-benchmarks/yellow_tripdata_2024-01.parquet'); +``` + +Next, let's query the foreign table `trips`. You'll notice that the column names and types of this table are automatically +inferred from the Parquet file. + +```sql +SELECT vendorid, passenger_count, trip_distance FROM trips LIMIT 1; +``` + + + ```csv vendorid | passenger_count | trip_distance + ----------+-----------------+--------------- 2 | 1 | 1.72 (1 row) ``` + + +Queries over this table are powered by [DuckDB](https://duckdb.org), an in-process analytical query engine. +This means that you can run fast analytical queries over data lakes from ParadeDB. + +```sql +SELECT COUNT(*) FROM trips; +``` + + + ```csv count --------- 2964624 (1 row) ``` + + +Finally, let's copy this table into a Postgres heap table. For demonstration, we will +copy over the first 100 rows. + +```sql +CREATE TABLE trips_copy AS SELECT * FROM trips LIMIT 100; +``` + +That's it! Please refer to the other sections for instructions on how to ingest from other [file and table formats](/integrations/formats) and [object stores](/integrations/object_stores). diff --git a/tests/README.md b/tests/README.md index d849f9bb..d0890cae 100644 --- a/tests/README.md +++ b/tests/README.md @@ -10,9 +10,12 @@ An example of doing all that's necessary to run the tests is, from the root of t set -x export DATABASE_URL=postgresql://localhost:28816/pg_analytics export RUST_BACKTRACE=1 + +# Reload pg_analytics cargo pgrx stop --package pg_analytics cargo pgrx install --package pg_analytics --pg-config ~/.pgrx/16.4/pgrx-install/bin/pg_config cargo pgrx start --package pg_analytics +# Run tests cargo test --package tests --features pg16 ``` From 937133b27b3b53c9f268da65952942e6bbdfabc4 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philippe=20No=C3=ABl?= Date: Mon, 28 Oct 2024 23:58:25 +0000 Subject: [PATCH 3/4] Update badge --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 406091f6..292a3d7a 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,7 @@ [![Publish pg_analytics](https://github.com/paradedb/pg_analytics/actions/workflows/publish-pg_analytics.yml/badge.svg)](https://github.com/paradedb/pg_analytics/actions/workflows/publish-pg_analytics.yml) [![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/paradedb)](https://artifacthub.io/packages/search?repo=paradedb) [![Docker Pulls](https://img.shields.io/docker/pulls/paradedb/paradedb)](https://hub.docker.com/r/paradedb/paradedb) -[![License](https://img.shields.io/github/license/paradedb/paradedb?color=blue)](https://github.com/paradedb/pg_analytics?tab=PostgreSQL-1-ov-file#readme) +[![License](https://img.shields.io/badge/License-PostgreSQL-blue)](https://github.com/paradedb/pg_analytics?tab=PostgreSQL-1-ov-file#readme) [![Slack URL](https://img.shields.io/badge/Join%20Slack-purple?logo=slack&link=https%3A%2F%2Fjoin.slack.com%2Ft%2Fparadedbcommunity%2Fshared_invite%2Fzt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ)](https://join.slack.com/t/paradedbcommunity/shared_invite/zt-2lkzdsetw-OiIgbyFeiibd1DG~6wFgTQ) [![X URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Ftwitter.com%2Fparadedb&label=Follow%20%40paradedb)](https://x.com/paradedb) From d326c0ade524d33facb65257025668302cc1ed52 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Philippe=20No=C3=ABl?= Date: Mon, 28 Oct 2024 23:59:59 +0000 Subject: [PATCH 4/4] Docs --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index 292a3d7a..22024d16 100644 --- a/README.md +++ b/README.md @@ -121,6 +121,8 @@ SELECT COUNT(*) FROM trips; Complete documentation for `pg_analytics` can be found under the [docs](/docs/) folder as Markdown files. It covers how to query the various object stores and file and table formats supports, and how to configure and tune the extension. +A hosted version of the documentation can be found [here](https://docs.paradedb.com/integrations/overview). + ## Development ### Install Rust