Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3 tile caching to Terraform config #686

Merged
merged 3 commits into from
Jan 31, 2019

Conversation

jeancochrane
Copy link
Contributor

Overview

Support Tilegarden caching by creating an S3 bucket for cached tiles, configuring CloudFront to try the cache first, and updating the app containers to enable caching.

Notes

  • I had to make a few small adjustments to Tilegarden to get this up and running. Because of that, I've requested @KlaasH's review on this as well.

Testing Instructions

  • In the VM, run a plan/apply cycle and confirm that only the task definitions get updated (to match new batch jobs that plan creates):
vagrant@vagrant-ubuntu-trusty-64:~$ export GIT_COMMIT=a7daa2a
vagrant@vagrant-ubuntu-trusty-64:~$ export ENVIRONMENT=staging
vagrant@vagrant-ubuntu-trusty-64:~$ export PFB_AWS_ECR_ENDPOINT="950872791630.dkr.ecr.us-east-1.amazonaws.com"
vagrant@vagrant-ubuntu-trusty-64:~$ export PFB_SETTINGS_BUCKET=staging-pfb-config-us-east-1
vagrant@vagrant-ubuntu-trusty-64:~$ ./scripts/infra plan
...
vagrant@vagrant-ubuntu-trusty-64:~$ ./scripts/infra apply
  • Visit the staging site at https://staging.pfb.azavea.com, select All Places, and choose a place to load up some tiles
  • If the view has been visited before: you should see rapid tile display, and the network tab should show that the tiles are coming from tiles.staging.pfb.azavea.com/
  • If the view has not been visited before: Lambda should take a few seconds to generate the tiles, and the network tab should show that the tiles are coming from tiles.staging.pfb.azavea.com/latest (the non-cached endpoint)
    • Hard refreshing the page may not confirm that the tile has been written to the cache, since (I believe) CloudFront will cache the 302 redirect to /latest for some short amount of time. To test that the cache is getting written, you can do two things:
      1. Check the cache S3 bucket staging-pfb-tile-cache-us-east-1 to confirm that the path to the tile you've requested exists in the cache
      2. Create a CloudFront invalidation for /tiles/* to clear the CloudFront cache, and then check your endpoint again to confirm that the tile got cached

Closes #634.

Jean Cochrane added 2 commits January 29, 2019 12:40
Support Tilegarden caching by creating an S3 bucket for cached tiles,
configuring CloudFront to try the cache first, and updating the
app containers to enable caching.
Fix two issues with the tile cache endpoint:

* Make sure the CDN has the right endpoint
* Make sure that Tilegarden configures environment variables
@@ -8,7 +8,6 @@ resource "aws_cloudfront_distribution" "tilegarden" {
}

domain_name = "${var.tilegarden_api_gateway_domain_name}"
origin_path = "/latest"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're routing /latest requests directly to the cache, we no longer need to set up this path in the origin config.

custom_origin_config {
http_port = 80
https_port = 443
# S3 websites don't support TLS :/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an established way around this? It seems icky to send the second leg of the request over plaintext.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not if we're using the S3 website endpoint. I haven't looked at the other parts of the request path, but has there been consideration to hitting the dynamic endpoint first?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that we would need to use the S3 website endpoint in order to use redirect rules, is there a way of accomplishing that with the dynamic endpoint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're just hitting static content, I think going over HTTP for this leg is OK.

@@ -24,8 +24,8 @@
],
"scripts": {
"build-all-xml": "./scripts/build-all-xml.sh src/config src/config",
"deploy": "yarn compile && claudia update --no-optional-dependencies ${LAMBDA_TIMEOUT:+--timeout ${LAMBDA_TIMEOUT}} ${LAMBDA_MEMORY:+--memory ${LAMBDA_MEMORY}} ${LAMBDA_SECURITY_GROUPS:+--security-group-ids ${LAMBDA_SECURITY_GROUPS}} ${LAMBDA_SUBNETS:+--subnet-ids ${LAMBDA_SUBNETS}}",
"deploy-new": "yarn compile && claudia create --no-optional-dependencies --api-module dist/api --name ${PROJECT_NAME} --region ${LAMBDA_REGION} ${LAMBDA_ROLE:+--role ${LAMBDA_ROLE}} ${LAMBDA_TIMEOUT:+--timeout ${LAMBDA_TIMEOUT}} ${LAMBDA_MEMORY:+--memory ${LAMBDA_MEMORY}} ${LAMBDA_SECURITY_GROUPS:+--security-group-ids ${LAMBDA_SECURITY_GROUPS}} ${LAMBDA_SUBNETS:+--subnet-ids ${LAMBDA_SUBNETS}} && yarn parse-id",
"deploy": "yarn compile && claudia update --no-optional-dependencies ${LAMBDA_TIMEOUT:+--timeout ${LAMBDA_TIMEOUT}} ${LAMBDA_MEMORY:+--memory ${LAMBDA_MEMORY}} ${LAMBDA_SECURITY_GROUPS:+--security-group-ids ${LAMBDA_SECURITY_GROUPS}} ${LAMBDA_SUBNETS:+--subnet-ids ${LAMBDA_SUBNETS}} ${PFB_TILEGARDEN_CACHE_BUCKET:+--set-env PFB_TILEGARDEN_CACHE_BUCKET=${PFB_TILEGARDEN_CACHE_BUCKET}}",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following #673, Tilegarden reads the cache bucket name from process.env.PFB_TILEGARDEN_CACHE_BUCKET. In order to get environment variables loaded into the Lambda execution environment, however, Claudia needs us to configure them via the --set-env flag.

} else if (key.startsWith('/')) {
// API Gateway request.path object starts with a leading slash,
// which would cause the uploaded object to have the wrong prefix.
key = key.slice(1)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While debugging, I noticed that cache tiles were getting written to the S3 bucket in an empty root directory (showing up as a whitespace character for the directory name). I'm pretty sure this is due to another difference between the API Gateway request object representations in claudia-local-api and AWS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and thanks for fixing it up.


viewer_protocol_policy = "redirect-to-https"
min_ttl = 0
default_ttl = "300" # Five minutes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I was testing out staging, I got some tiles that timed out, then the error message was cached for 5 minutes even though the Lambda function had actually succeeded and written the tile to S3.

So that points to one issue that should be addressed, I assume on the origin definition--the "Origin Response Timeout" should be no shorter than the timeout on the Lambda function (currently 1 minute).

I also think it makes sense to set the TTL for the "new tiles" endpoint as short as is reasonable, so that if anything goes wrong it won't be as long until it tries again. There might not actually be a period during which serving a cached response on the API Gateway origin saves a Lambda invocation--before the first one finishes there's no cached response to return, and the tile gets written to S3 before it gets returned from the Lambda function, so there might not be a time during which a request to the S3 cache would miss but one to the API Gateway origin would hit.

So 5 minutes seems fine for the S3 origin (longer would probably be fine, too, but I don't know if there would be any advantage to it) but it should be overridden for the "new tiles" cache_behavior below, probably to also be 1 minute or slightly more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm glad you caught this, very subtle but important behavior here. I'll bump the Origin Read Timeout to 60 seconds, which should hopefully make timeouts less common.

I agree, I don't see any benefit to caching the Lambda origin. Can you think of any downside to just setting min_ttl, default_ttl, and max_ttl to 0 for that origin?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of any downside to just setting min_ttl, default_ttl, and max_ttl to 0 for that origin?

Actually now I'm thinking I might have missed an interaction in what I wrote above--would the redirect from / to /latest/ be cached using the default TTL from default_cache_behavior? I.e. having gotten a redirect on the initial attempt, will it keep redirecting until the timeout has passed? I was thinking the cached error was entirely the fault of the cache rule on the Lambda endpoint, but now I'm thinking I was getting two cached responses. In which case having a longer TTL on the root/S3 endpoint than on the Lambda endpoint would mean extra invocations until the cached redirect expires.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I.e. having gotten a redirect on the initial attempt, will it keep redirecting until the timeout has passed?

Good point, I think it will indeed cache the redirect -- see my note in the testing instructions above. Sounds like the TTLs should be identical for both origins.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like. The change to origin_read_timeout should take care of the original issue I saw, though, so hopefully there won't be any caching of bad responses anyway.

* Adjust Lambda Tilegarden CDN origin so that its timeout matches
  the lambda function timeout
* Run terraform fmt to clean up some bad formatting
@jeancochrane jeancochrane force-pushed the feature/jfc/s3-tile-cache branch from a9a3cd3 to 766f429 Compare January 30, 2019 20:59
Copy link
Contributor

@KlaasH KlaasH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👍

Copy link
Contributor

@hectcastro hectcastro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jeancochrane jeancochrane merged commit bf9a357 into develop Jan 31, 2019
@jeancochrane jeancochrane deleted the feature/jfc/s3-tile-cache branch January 31, 2019 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants