OpenSearch (ES) --> Data Tap --> S3 Parquet

OpenSearch (ES) --> Data Tap --> S3 Parquet

"The same data costs 70-100x more on a highly available (HA) OpenSearch cluster with EBS Volumes vs. S3 with compressed Parquet files!*"

This multi-threaded (configurable number of workers) node application uses OpenSearch (Elasticsearch) sliced Scroll API to efficiently dump an index to S3 via a Data Tap.

Data Tap

A Data Tap is a single AWS Lambda function with Function URL and customized C++ runtime embedding DuckDB. It uses streaming SQL clause to upload the buffered HTTP POSTed newline JSON data in the Lambda to S3, hive partitioned, and as ZSTD compressed Parquet. You can tune the SQL clause your self for filtering, search, and aggregations. You can also set the thresholds when the upload to S3 happens. A Data Tap runs already very efficiently with the smallest arm64 AWS Lambda, making it the simplest, fastest, and most cost efficient solution for streaming data onto S3 in scale. You can run it on your own AWS Account or hosted by Boiling Cloud.

You need to have BoilingData account and use it to create a Data Tap. The account is used to fetch authorization tokens which allow you to send data to a Data Tap (security access control). You can also share write access (see the AUTHORIZED_USERS AWS Lambda environment variable) to other BoilingData users if you like, efficiently creating Data Mesh architectures.

Run

1. (optional) Start local OpenSearch cluster and add test data

Each call adds 1m small docs via the Bulk API. You can run it multiple times to get more data. The data is dummy, the same single entry.

ES_PASSWORD='Admin123__kjljklkjl---' yarn up

time \
    INDEX=books \
    ES_USERNAME=admin \
    ES_PASSWORD='Admin123__kjljklkjl---' \
    ES_HOST=localhost \
    ES_PORT=9200 \
    node src/addDocs.local.js

2. Dump OpenSearch index into S3 as Parquet files via a Data Tap.

You need your Data Tap URL in BD_TAPURL environment variable as well as your BoilingData account credentials in BD_USERNAME and BD_PASSWORD.

time \
    INDEX=books \
    ES_USERNAME=admin \
    ES_PASSWORD='Admin123__kjljklkjl---' \
    ES_HOST=localhost \
    ES_PORT=9200 \
    WORKERS=10 \
    BATCH_SIZE=2000 \
    BD_TAPURL=addYourTapUurl \
    BD_USERNAME=addYourBdUsername \
    BD_PASSWORD=addYourBdPassword \
    node src/index.js

Clipped output showing network capped results (i.e. utilising full capacity of my home broadband uplink in this example case), where throughput is the actual upload payload only data volume.

{ totalCount: 2186008 }
{ id: 1, p: '0%', bytes: 318918 }
{ id: 7, p: '0%', bytes: 318913 }
{ id: 1, p: '1.84%', bytes: 318918 }
{ id: 7, p: '1.84%', bytes: 318913 }
{ id: 9, p: '0%', bytes: 318876 }
{ id: 2, p: '0%', bytes: 318841 }
{ id: 9, p: '1.83%', bytes: 318876 }
{ id: 8, p: '0%', bytes: 318902 }
{ id: 2, p: '1.83%', bytes: 318841 }
{ id: 8, p: '1.82%', bytes: 318902 }
{ id: 3, p: '0%', bytes: 318836 }
{ id: 5, p: '0%', bytes: 318886 }
{ id: 3, p: '1.83%', bytes: 318836 }
{ id: 5, p: '1.83%', bytes: 318886 }
{ id: 6, p: '0%', bytes: 318826 }
{ id: 10, p: '0%', bytes: 318970 }
...
{ id: 4, p: '100.00%', bytes: 37753617 }
{ id: 9, p: '99.98%', bytes: 37401533 }
{ id: 4, p: '100.00%', bytes: 37930929 }
{ id: 9, p: '100.00%', bytes: 37750612 }
{ id: 9, p: '100.00%', bytes: 37757436 }
{
  sentCount: 2186008,
  totalCount: 2186008,
  sentMBytes: '361.05',
  throughput: '9.11'
}
✨  Done in 40.07s.

*) ES replication (2-3x), EBS volume utilisation (50-75%), EBS volumen cost (3.8x more than S3), and the heavy indexing of the data affect the cost efficiency factor. However, the biggest difference comes from the fact that very high S3 durability removes the replication neeed and compressed columnar data format Parquet shrinks the data very efficiently - especially with sorted data driving the factor being even more efficient!

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
img		img
src		src
.gitignore		.gitignore
README.md		README.md
docker-compose.yaml		docker-compose.yaml
package.json		package.json
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenSearch (ES) --> Data Tap --> S3 Parquet

Data Tap

Run

1. (optional) Start local OpenSearch cluster and add test data

2. Dump OpenSearch index into S3 as Parquet files via a Data Tap.

About

Releases

Packages

Languages

boilingdata/data-taps-opensearch-to-s3

Folders and files

Latest commit

History

Repository files navigation

OpenSearch (ES) --> Data Tap --> S3 Parquet

Data Tap

Run

1. (optional) Start local OpenSearch cluster and add test data

2. Dump OpenSearch index into S3 as Parquet files via a Data Tap.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages