The Terra Data Repository built by the Jade team as part of the Data Biosphere.
- Support complex, user-specified schemas and different types of data
- Fine-grained access control
- Support share-in-place, copying data is expensive
- Cloud-transparency: support off-the-shelf tool access to data
More information can be found in our terra support documentation.
This repository is currently designed to be deployed inside of a Google Cloud Platform project to manage tabular and file data backed either by GCP or Azure. The project setup has been automated via Terraform.
Follow our getting started guide to get set up.
If you are making code changes, run:
./gradlew check
To verify that TDR adheres to the contracts published by its consumers, run:
./src/test/render-pact-configs.sh
# Reload your environment variables, e.g. src ~/.zshrc
./gradlew verifyPacts # verify contracts published with TDR as the provider
By default, this will fetch published contracts from the live Pact broker. Results of Pact verification are only published when running in a CI environment (not locally).
Before you run for the first time, you need to generate the credentials file by running ./render-configs.sh
To run TDR locally:
./gradlew bootRun
To run TDR locally and wait for debugger to attach on port 5005:
./gradlew bootRun --debug-jvm
To have the code hot reload, enable automatic builds in intellij, go to:
Preferences -> Build, Execution, Deployment -> Compiler
and select Build project automatically
Note: when running locally, it may be useful to not log in JSON but as traditional log message. This can be enabled by
setting the environment variable:
TDR_LOG_APPENDER=Console-Standard
(the default is "Console-Stackdriver")
The swagger page is: https://local.broadinstitute.org:8080
./gradlew testConnected
The integration tests will hit the data repo running in the broad-jade-integration environment by default. To use a
different data-repo, edit the src/main/resources/application-integration.properties file and specify the URL. Before
you run the integration tests, you need to generate the correct pem file by running ./render-configs.sh
To run the tests, use: ./gradlew testIntegration
SourceClear is a static analysis tool that scans a project's Java dependencies for known vulnerabilities. If you are working on addressing dependency vulnerabilities in response to a SourceClear finding, you may want to run a scan off of a feature branch and/or local code.
You can trigger TDR's SCA scan on demand via its
Github Action,
and optionally specify a Github ref (branch, tag, or SHA) to check out from the repo to scan. By default,
the scan is run off of TDR's develop
branch.
High-level results are outputted in the Github Actions run.
You will need to get the API token from Vault before running the Gradle srcclr
task.
export SRCCLR_API_TOKEN=$(vault read -field=api_token secret/secops/ci/srcclr/gradle-agent)
./gradlew srcclr
High-level results are outputted to the terminal.
Full results including dependency graphs are uploaded to Veracode (if running off of a feature branch, navigate to Project Details > Selected Branch > Change to select your feature branch). You can request a Veracode account to view full results from #dsp-infosec-champions.
We are using swagger-codegen to generate code from the swagger (OpenAPI) document. Therefore, in order to build you need to have the codegen tool installed from swagger-codegen.
The gradle compile uses swagger-codegen to generate the model and controller interface code into
src/generated/java/bio/terra/models
and src/generated/java/bio/terra/controllers
respectively. Code in src/generated
is not committed to github. It is generated as needed.
Adding an endpoint to the API source (data-repository-openapi.yaml) will generate the endpoint definition in the
appropriate controller interface file. Swagger-codegen provides a default implementation of the endpoint that generates
a NOT_IMPLEMENTED return. You add the actual implementation of the new interface by editing the Jade controller code
in src/main/java/bio/terra/controller
. That overrides the default interface implementation.
Clearly, you can make breaking changes to the API and will have to do the appropriate refactoring in the rest of the code base. For simple addition of fields in a structure or new endpoints, the build will continue to run clean.
In the rare case of wanting to have swagger-codegen create a controller class,
in a directory other than a git cloned workspace, run:
swagger-codegen generate -i path/to/data-repository-openapi.yaml -l spring -c path/to/config.json
Then copy the files you want into the source tree
To render your own local skaffold.yaml run the following with your initials
sed -e 's/TEMP/<initials>/g' skaffold.yaml.template > skaffold.yaml
Run a deployment you must set env var IMAGE_TAG
skaffold run
- Locally, application properties are controlled by the values in the various application.properties files.
application.properties
contains the base/default values. A new property should be added here first.
google.allowReuseExistingBuckets=false
- You can override the default value for connected and integration tests by adding a line to
application-connectedtest.properties
andapplication-integrationtest.properties
.
google.allowReuseExistingBuckets=true
- Now that we use Helm, the properties also need to be added to the
base Data Repo charts.
- Find the the api-deployment.yaml file.
- Add a new property under the
env
section. The formatting below might be messed up, and the yaml is very picky about spaces. So, copy/paste from another variable in the section instead of here.
{{- if .Values.env.googleAllowreuseexistingbuckets }} - name: GOOGLE_ALLOWREUSEEXISTINGBUCKETS value: {{ .Values.env.googleAllowreuseexistingbuckets | quote }} {{- end }}
- Find the the values.yaml file.
- Add a new line under the
env
section.
googleAllowreuseexistingbuckets:
- Release a new version of the chart. Talk to DevOps to do this.
- To override properties for specific environments (e.g. integration), modify the
environment-specific override Data Repo charts.
- Find the deployment.yaml for the specific environment.
- Add a new line under the
env
section.
googleAllowreuseexistingbuckets: true
- It's a good idea to test out changes on your developer-namespace before making a PR.
- Changes to integration, temp, or developer-namespace environments are good with regular PR approval (1 thumb for this repository).
- Changes to dev or prod need more eyes, and perhaps a group discussion to discuss possible side effects or failure modes.
Care must be taken with a handling of InterruptedException. Code running in stairway flights must expect to receive InterruptedException during waiting states when the pod is being shut down. It is important that exception be allowed to propagate up to the stairway layer, so that proper termination and re-queuing of the flight can be performed.
On the other hand, code running outside of the stairway flight, where the exception can become the response to a REST API, should catch the InterruptedException and replace it with a meaningful exception. Otherwise the caller gets a useless error message.
The deployments of Terra Data Repository are:
Sonar is a static analysis code that scans code for a wide range of issues, including maintainability and possible bugs. If you get a build failure due to SonarQube and want to debug the problem locally, you need to get the sonar token from vault before running the gradle task.
export SONAR_TOKEN=$(vault read -field=sonar_token secret/secops/ci/sonarcloud/data-repo)
./gradlew sonar
Running this task produces no output unless your project has errors. To always
generate a report, run using --info
:
./gradlew sonar --info