Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Add rewrite-table-path in spark procedure #12115

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
84 changes: 83 additions & 1 deletion docs/docs/spark-procedures.md
Original file line number Diff line number Diff line change
Expand Up @@ -972,4 +972,86 @@ CALL catalog_name.system.compute_table_stats(table => 'my_table', snapshot_id =>
Collect statistics of the snapshot with id `snap1` of table `my_table` for columns `col1` and `col2`
```sql
CALL catalog_name.system.compute_table_stats(table => 'my_table', snapshot_id => 'snap1', columns => array('col1', 'col2'));
```
```

## Table Replication

### `rewrite-table-path`

This procedure rewrites an iceberg table's metadata files to a new location by replacing all source prefixes in absolute paths
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
with a specified target prefix. After copying both metadata and data to the desired location, the replicated iceberg
table will appear identical to the source table, including snapshot history, schema and partition specs.

!!! info
this procedure serves as the starting point to fully or incrementally copying an iceberg table to a new location. Copying all
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
metadata and data files from source to target location is not included as part of this procedure.
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved

| Argument Name | Required? | Type | Description |
|--------------------|-----------|--------|------------------------------------------------------------------------------------------------------------|
| `table` | ✔️ | string | Name of the table |
| `source_prefix` | ✔️ | string | source prefix to be replaced |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
| `target_prefix` | ✔️ | string | target prefix |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
| `start_version` | | string | First metadata version to rewrite, identified by name of a metadata.json file in the table's metadata log. |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
| `end_version` | | string | Last metadata version to rewrite, identified by name of a metadata.json file in the table's metadata log. |
| `staging_location` | | string | Custom staging location |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved

#### Output

| Output Name | Type | Description |
|----------------------|--------|---------------------------------------------------------------------------------|
| `latest_version` | string | Name of latest metadata file version |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
| `file_list_location` | string | Path to a file containing a listing of comma-separated paths ready to be copied |
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved

Example file list content :

```csv
sourcepath/datafile1.parquet,targetpath/datafile1.parquet
sourcepath/datafile2.parquet,targetpath/datafile2.parquet
stagingpath/manifest.avro,targetpath/manifest.avro
```

#### Examples

Full rewrite of a table's metadata path from source location in HDFS to a target location in S3 bucket of table `my_table`
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved

```sql
CALL catalog_name.system.rewrite_table_path(
table => 'db.my_table',
source_prefix => "hdfs://nn:8020/path/to/source_table",
target_prefix => "s3a://bucket/prefix/db.db/my_table"
);
```
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved

Incremental rewrite of a table's metadata path from a source location to a target location between metadata version
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
`v2.metadata.json` and `v3.metadata.json`, with files written to a staging location

```sql
CALL catalog_name.system.rewrite_table_path(
table => 'db.my_table',
source_prefix => "s3a://bucketOne/prefix/db.db/my_table",
target_prefix => "s3a://bucketTwo/prefix/db.db/my_table",
start_version => "v2.metadata.json",
end_version => "v3.metadata.json",
staging_location => "s3a://bucketStaging/my_table"
);
```

Once the rewrite is completed, third-party tools (
eg. [Distcp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html)) can be used to copy the newly created
metadata files and data files to the target location. Here is an example of reading result from file list location in
Spark.

```java
List<String> filesToMove =
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved
spark
.read()
.format("text")
.load(result.fileListLocation())
.as(Encoders.STRING())
.collectAsList();
```

Lastly, [register_table](#register_table) procedure can be used to register copied table in the target location with catalog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we mentioned the validation to make sure table is usable?

Copy link
Contributor Author

@dramaticlly dramaticlly Jan 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some offline discussion with @RussellSpitzer, I think referential integrity validation before table registration is great but it's hard to tell customer to do X without information on how to do X.

My plan is omit this for now in documentation, expand the logic in register_table procedure to allow for optional referential integrity check in another PR and come back to update this.

What do you think @flyrain ?


!!! warning
Iceberg table with statistics files is not currently supported
dramaticlly marked this conversation as resolved.
Show resolved Hide resolved