Skip to content

Commit

Permalink
Merge pull request #180 from broadinstitute/FE-287-implement-copy-fro…
Browse files Browse the repository at this point in the history
…m-tdr-to-gcs

FE-287 implement copy_from_tdr_to_gcs_hca
  • Loading branch information
bahill authored Aug 14, 2024
2 parents 7131996 + d647b61 commit 6d34efe
Show file tree
Hide file tree
Showing 14 changed files with 442 additions and 93 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Build and Publish Dev Images for scripts/tdr/copy_from_tdr_to_gcs_hca
on:
push:
branches-ignore: [main]
paths:
- scripts/copy_from_tdr_to_gcs_hca/**
- .github/workflows/build_and_push_docker_copy_from_tdr_to_gcs_hca_dev.yaml
env:
GCP_PROJECT_ID: dsp-fieldeng-dev
GCP_REPOSITORY: horsefish
GITHUB_SHA: ${{ github.sha }}

jobs:
build-and-push-dev-images:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1

- name: Login to GCP
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.BASE64_SAKEY_DSPFIELDENG_GARPUSHER }}

- name: Configure Docker to use the Google Artifact Registry
run: gcloud auth configure-docker us-east4-docker.pkg.dev

- name: Build and Push copy_from_tdr_to_gcs_hca Docker Image
run: |
docker build -t us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA -f scripts/tdr/copy_from_tdr_to_gcs_hca/Dockerfile scripts/tdr/copy_from_tdr_to_gcs_hca
docker push us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA
- name: Set image tag to 'dev'
run: |
docker tag us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:dev
docker push us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:dev
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
name: Build and Publish Latest Images for scripts/tdr/copy_from_tdr_to_gcs_hca
on:
pull_request_target:
types:
- closed
branches:
- main
paths:
- scripts/tdr/copy_from_tdr_to_gcs_hca/**
- .github/workflows/build_and_push_docker_copy_from_tdr_to_gcs_hca_main.yaml
env:
GCP_PROJECT_ID: dsp-fieldeng-dev
GCP_REPOSITORY: horsefish
GITHUB_SHA: ${{ github.sha }}

jobs:
build-and-push-dev-images:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1

- name: Login to GCP
uses: google-github-actions/auth@v1
with:
credentials_json: ${{ secrets.BASE64_SAKEY_DSPFIELDENG_GARPUSHER }}

- name: Configure Docker to use the Google Artifact Registry
run: gcloud auth configure-docker us-east4-docker.pkg.dev

- name: Build and Push copy_from_tdr_to_gcs_hca Docker Image
run: |
docker build -t us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA -f scripts/tdr/copy_from_tdr_to_gcs_hca/Dockerfile scripts/tdr/copy_from_tdr_to_gcs_hca
docker push us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA
- name: Set image tag to 'latest'
run: |
docker tag us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:latest
docker push us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:latest
2 changes: 1 addition & 1 deletion .github/workflows/build_and_push_docker_gen_dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ on:
branches-ignore: [main]
paths:
- scripts/general/**
- .github/workflows/**
- .github/workflows/build_and_push_docker_gen_dev.yaml
env:
GCP_PROJECT_ID: dsp-fieldeng-dev
GCP_REPOSITORY_GENERAL: horsefish
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_and_push_docker_gen_main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ on:
- main
paths:
- scripts/general/**
- .github/workflows/**
- .github/workflows/build_and_push_docker_gen_main.yaml
env:
GCP_PROJECT_ID: dsp-fieldeng-dev
GCP_REPOSITORY_GENERAL: horsefish
Expand Down
22 changes: 22 additions & 0 deletions scripts/general/test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Usage: python test.py file1 file2
# file1: file to loop through
# file2: file to search for matching string
# if no matching string found, print the string
# if found, do nothing
# output: print the string if no matching string found, print "found" if all strings are found

import sys

with open(sys.argv[1]) as f:
lines = [line.strip() for line in f.readlines()]

with open(sys.argv[2], 'r') as f2:
data = f2.read()

counter = 0
for line in lines:
if line not in data:
print(f'Not Found: {line}')
counter += 1
else:
print("counter")
17 changes: 0 additions & 17 deletions scripts/tdr/copy_from_tdr_to_gcs/README.md

This file was deleted.

69 changes: 0 additions & 69 deletions scripts/tdr/copy_from_tdr_to_gcs/from_bash_copy_from_tdr.py

This file was deleted.

34 changes: 34 additions & 0 deletions scripts/tdr/copy_from_tdr_to_gcs_hca/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
FROM us.gcr.io/broad-dsp-gcr-public/base/python:3.12-alpine

ENV PATH /google-cloud-sdk/bin:$PATH
RUN if [ `uname -m` = 'x86_64' ]; then echo -n "x86_64" > /tmp/arch; else echo -n "arm" > /tmp/arch; fi;
RUN ARCH=`cat /tmp/arch` && apk --no-cache upgrade && apk --no-cache add \
bash \
curl \
python3 \
py3-crcmod \
py3-openssl \
bash \
libc6-compat \
openssh-client \
git \
gnupg \
&& curl -O https://dl.google.com/dl/cloudsdk/channels/rapid/downloads/google-cloud-cli-linux-x86_64.tar.gz && \
tar xzf google-cloud-cli-linux-x86_64.tar.gz && \
rm google-cloud-cli-linux-x86_64.tar.gz && \
gcloud config set core/disable_usage_reporting true && \
gcloud config set component_manager/disable_update_check true && \
gcloud config set metrics/environment docker_image_alpine && \
gcloud --version
RUN git config --system credential.'https://source.developers.google.com'.helper gcloud.sh
VOLUME ["/root/.config"]

WORKDIR /scripts/tdr/copy_from_tdr_to_gcs_hca

# copy the contents of /scripts/tdr/copy_from_tdr_to_gcs_hca to the WORKDIR
COPY * .

RUN pip install -r requirements.txt

ENV PYTHONPATH "/scripts:${PYTHONPATH}"
CMD ["/bin/bash"]
71 changes: 71 additions & 0 deletions scripts/tdr/copy_from_tdr_to_gcs_hca/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copy from TDR to GCS
This was originally a bash script written by Samantha Velasquez\
[get_snapshot_files_and_transfer.sh](get_snapshot_files_and_transfer.sh) \
which was written to copy files from a TDR snapshot to an Azure bucket. \
Bobbie then translated to python using CoPilot and it ballooned from there. \
[copy_from_tdr_to_gcs.py](copy_from_tdr_to_gcs.py) \
The bash script is now just here for posterity as it previously only lived in Slack.
It has not been tested in the Docker image created for the Python script.

## Running the Script
**IMPPORTANT**\
You will need to be in either the [Monster Group](https://groups.google.com/a/broadinstitute.org/g/monster)
or the [Field Eng group](https://groups.google.com/a/broadinstitute.org/g/dsp-fieldeng) to run this script.

You will want to clone the whole horsefish repo, if you have not done so already.

You will also need a manifest file to run the script.\
The format of this manifest is identical to the one use for [HCA ingest](https://docs.google.com/document/d/1NQCDlvLgmkkveD4twX5KGv6SZUl8yaIBgz_l1EcrHdA/edit#heading=h.cg8d8o5kklql).
A sample manifest is provided in the project directory - dcpTEST_manifest.csv.\
(Note that this is a test manifest and you will have to first load the data into TDR to use it - see the HCA ingest Ops manual linked above).\
It's probably easiest to copy out the rows from the original ingest manifest into a new manifest,
then move that file into this project directory, so that it is picked up by compose.

If you are not already logged in to gcloud/docker, you will need to do so before running the Docker compose command.\
`gcloud auth application-default login` \
`gcloud auth configure-docker us-east4-docker.pkg.dev`

To start up the run/dev Docker compose env \
`docker compose run app bash`\
This will pull the latest image from Artifact Registry, start up the container, and mount the project dir,
so changes in your local project dir will be reflected in the container.

Next you will want to authenticate with gcloud using your Broad credentials.\
`gcloud auth login`\
`gcloud config set project dsp-fieldeng-dev`* \
`gcloud auth application-default login` \
If you are not in dsp-fieldeng-dev
Then run the script using the following command syntax:\
`python3 copy_from_tdr_to_gcs_hca.py <manifest_file>'`

Contact Field Eng for any issues that arise. \
_*or the monster hca prod project - mystical-slate-284720_

## Building the Docker Image
The image builds with the GitHub Action "Main Validation and Release" ../.github/workflows/build-and-push_docker_copy_from_tdr_to_gcs_hca_main.yaml
and ../.github/workflows/build-and-push_docker_copy_from_tdr_to_gcs_hca_dev.yaml
tags = us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:$GITHUB_SHA,
us-east4-docker.pkg.dev/$GCP_PROJECT_ID/$GCP_REPOSITORY/copy_from_tdr_to_gcs_hca:latest

### To manually build and run locally
`docker build -t us-east4-docker.pkg.dev/dsp-fieldeng-dev/horsefish/copy_from_tdr_to_gcs_hca:<new_version> .` \
`docker run --rm -it us-east4-docker.pkg.dev/dsp-fieldeng-dev/horsefish/copy_from_tdr_to_gcs_hca:<new_version>`

### To build and push to Artifact Registry
- make sure you are logged in to gcloud and that application default credentials are set \
`gcloud auth login` \
`gcloud config set project dsp-fieldeng-dev` \
`gcloud auth application-default login`
- set the <new_version> before building and pushing \
`docker push us-east4-docker.pkg.dev/dsp-fieldeng-dev/horsefish/copy_from_tdr_to_gcs_hca:<new_version>`


## Possible improvements*
- update the script with conditional logic to accept a snapshot ID and destination instead
- update the script check lower case institution against lower case institution keys - see ~line 86
- update the script to merge `validate_input()` and `_parse_csv()` into one function
- Consider adding a copy manifest to this command, so instead you validating number of files copied (line 187), you can specifically highlight the files not copied successfully.

*this is likely to be used only rarely and mostly by the author, as a stop gap until partial updates have been implemented.
As such, we are attempting to keep this as light as possible, so as not to introduce unnecessary complexity.

Loading

0 comments on commit 6d34efe

Please sign in to comment.