Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable function execution/APIGW response times #418

Open
smoll opened this issue Oct 20, 2016 · 12 comments
Open

Unpredictable function execution/APIGW response times #418

smoll opened this issue Oct 20, 2016 · 12 comments

Comments

@smoll
Copy link
Contributor

smoll commented Oct 20, 2016

This is more of a question/investigatory issue rather than a bug report per se, but I've noticed this behavior on 2 different Zappa-based APIs I have started working on.

Basically, I've noticed that when I call an endpoint after not using it for a while, I will get a very long response time, e.g.

$ time http https://abcd.execute-api.us-east-1.amazonaws.com/dev/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 19
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Oct 2016 07:18:17 GMT
Set-Cookie: zappa=AQ4; Path=/
Via: 1.1 2e4bb09f82d346fc09032cfb35e3c3e4.cloudfront.net (CloudFront)
X-Amz-Cf-Id: lEhkWktpPID6i-IflVRUPALDgCRF8agsL7CI2jnyvD1CcPVKSjG1OQ==
X-Cache: Miss from cloudfront
x-amzn-Remapped-Content-Length: 19
x-amzn-RequestId: 5f184bf1-9695-11e6-9e74-b70f8d856374

This is the index!


real    0m5.518s
user    0m0.275s
sys     0m0.255s

but if I call it again immediately after I get a much faster response

$ time http https://abcd.execute-api.us-east-1.amazonaws.com/dev/
HTTP/1.1 200 OK
Connection: keep-alive
Content-Length: 19
Content-Type: text/html; charset=utf-8
Date: Thu, 20 Oct 2016 07:18:19 GMT
Set-Cookie: zappa=AQ4; Path=/
Via: 1.1 031c38bec1e4f8401157e1d767a53637.cloudfront.net (CloudFront)
X-Amz-Cf-Id: wTr7B0ZTGHTY7GRm-Uw2IY76pcPhc8Bnb7wMLiDhFR9wwd4XFuWd5w==
X-Cache: Miss from cloudfront
x-amzn-Remapped-Content-Length: 19
x-amzn-RequestId: 6077b516-9695-11e6-a92a-657b31009f49

This is the index!


real    0m0.771s
user    0m0.264s
sys     0m0.168s

The "API" implementation is a simple print statement so it's not worth showing, but here are my Zappa settings:

{
    "dev": {
        "project_name": "myapi",
        "app_function": "myapi.app",
        "manage_roles": false,
        "role_name": "ZappaLambdaExecution",
        "s3_bucket": "dev-myapi-bucket",
        "domain": "myapi.com",
        "lets_encrypt_key": "s3://mybucket/account.key",
        "lets_encrypt_expression": "rate(2 days)",
        "keep_warm": true,
        "keep_warm_expression": "rate(1 minute)",
        "timeout_seconds": 69
    }
}

^ I increased the frequency on the keep_warm_expression key to see if it helps, but it doesn't seem to.

Don't see anything in CloudWatch either, all of the log lines look like this:

REPORT RequestId: dde080d0-9695-11e6-9593-ab204f2952ay  Duration: 0.62 ms   Billed Duration: 100 ms Memory Size: 512 MB Max Memory Used: 21 MB

any ideas where to look for more info?

@smoll smoll closed this as completed Oct 20, 2016
@smoll smoll reopened this Oct 20, 2016
@smoll
Copy link
Contributor Author

smoll commented Oct 23, 2016

I've been playing around with frameworkless Lambda + API Gateway and I'm noticing the exact same behavior when I have a standalone Lambda that I never pre-warm.

This makes me wonder whether my Zappa-managed keep_warm is doing anything at all. Has anyone here actually verified the keep_warm is correctly keeping their function warm? If so, I'll try to look into what might be different about my configuration. If not, I'll dig more into the Zappa source code around it.

Please chime in if you have any relevant data to share!

Edit: I must have glossed over this before, but I do see lines like this in Cloudwatch every minute, so this definitely means the event is firing. It does seem like it is having no effect on the startup time of the function, though:

[DEBUG] 2016-10-23T09:24:53.743Z    8e3aa106-9902-11e6-918a-c988a846fd49    Zappa Event: {u'account': u'123456789012', u'region': u'us-east-1', u'detail': {}, u'detail-type': u'Scheduled Event', u'source': u'aws.events', u'version': u'0', u'time': u'2016-10-23T09:24:38Z', u'id': u'a85fe08b-add6-4fd4-bcda-d32d14eadc76', u'resources': [u'arn:aws:events:us-east-1:123456789012:rule/myproject-zappa-keep-warm-handler.keep_warm_callback']}

@jnlaia
Copy link
Contributor

jnlaia commented Nov 10, 2016

I am having similar issues.

Below (left axis are miliseconds), a keep warm with a 3 minutes rate, seems to have spikes every three hours or so, which is interesting/weird by itself. Also, when I actually make a manual request (around 10:45), I still have a high response time.

image

Does anyone have a clue on what is going on?

Thanks

@jnlaia
Copy link
Contributor

jnlaia commented Nov 11, 2016

Some more feedback on this. It actually seems that the keep warm from zappa is not doing anything. However, if one warms up the webapp by making an actual request via the API gateway (for example, via another service), it is possible to keep the webapp warm. So it seems the API gateway is playing a role here.

@wagmiwiz
Copy link
Contributor

I am seeing this too.

We have one api that has irregular but bursty traffic. So not much happens, then suddenly many requests happen at once. If you do the burst batches one after the other then the second batch is super fast. But the first batch has many requests taking several seconds. I noticed its due to several Lambdas spinning up (Instancing... in the logs), and that takes 2-5 seconds.

One wild guess is that keep warm just keeps one Lambda warm and a burst obviously requires concurrency..... or it is indeed something to do with a difference in the way Lambda routes the event depending on whether it comes from API Gateway or is scheduled. 🤔

@danielwhatmuff
Copy link
Contributor

Might be worth using https://uptimerobot.com/ or https://updown.io to send a request via API Gateway as an additional keep warm. I do that and haven't seen this behaviour.

@jleclanche
Copy link

So, today I investigated a long-standing issue we had with one of our zappa-powered lambdas. The lambda is configured with a 30 second timeout and is expected to have an avg runtime of 500ms.

image

This doesn't look right: We're getting a ton of 30 seconds max durations, we can see in the graph on the right that there are increased timeouts at peak time.

I had assumed the timeouts were related to some kind of database or external-api contention, but I found that they were all in cold starts. Some of the lambda machines would need up to 40 seconds to do a cold start; if they hit the timeout, they would get stuck permanently trying to cold start and eventually give up and be rotated out. Increasing the default timeout of 30 seconds to 45 seconds actually fixed the issue. See the red line here:

image

Now, the real question is why is a cold start taking so long? I profiled a lambda that took 22 seconds to complete. Of the 22 seconds, the following two lines of code (in zappa.ext.django_zappa.get_django_wsgi) took 21 seconds to run:

from django.core.wsgi import get_wsgi_application  # 6500ms
...
get_wsgi_application()  # 14500ms

This is a 128MB lambda (Python 3.6 runtime), so it has the weakest cpu. I expected cold starts to be bad, but this is pretty brutal.
I strace'd those two calls locally and nothing really stands out. Most of the time is spent iterating over all the dependencies (and there's really not that many: django, django-allauth, django-cors-headers, oauth-toolkit, djangorestframework, psycopg2, pyjwt and raven). Most of that time is spent opening Zappa itself and its dependencies. psycopg2 is pretty bad as well.
But on my home cpu, all this takes 300ms. I could understand it being one order of magnitude slower on the weakest lambda, but two is really confusing. Unfortunately, running an strace on lambda isn't exactly easy, so I'm not completely sure what to do next to improve the situation.

Hope this helps someone.

@wagmiwiz
Copy link
Contributor

This is really good information, kudos! I suspect it is probably something to with this dependency loading since I noticed our cold starts are slower for lambdas where we have more dependencies/larger deploy files (a factor of 2 to 3 from an api that mostly has django, django rest and psycopg2 to one that also has numpy). For now we are using brute force of a higher CPU lambda to bring it down to a few seconds so it doesn't affect our users as much but its obviously not a long term solution.

@wagmiwiz
Copy link
Contributor

p.s. how did you profile the lambda to put a number on those two lines?

@jleclanche
Copy link

I just added some timing calls in Zappa and deployed that instead.

@jakul
Copy link
Contributor

jakul commented Oct 25, 2018

I've done some work recently to reduce application startup times on cold start, and came up with the following conclusions. They might be specific to my application, but I'm posting them in case they can guide someone else:

  • cProfile memory traces really help to understand what's going on. I used iopipe to setup the tracing and to get the traces out of the lambda, but you could also just implement those bits yourself. https://github.com/iopipe/iopipe-python/blob/3dce171ac1981ea89c08bbdd086fa1bb7e677204/iopipe/contrib/profiler/plugin.py shows how they do it
  • I used snakeviz and gprof2dot to help visualise the traces
  • The biggest causes of slow startup for my application were:
    • validation of a Swagger spec by Connexion
    • uploading .py files instead of .pyc files (Python takes more time to import .py files)
    • use of pkg_resources library
    • I also noticed that the bigger the lambda the quicker it can actually read files from the file system (which is done as part of the module import) - on 128MB the get_data took 2.5 seconds, on 512MB it took 0.57s and on 1024MB it took 0.32s

@jleclanche
Copy link

use of pkg_resources library

Can you elaborate on that? I usually use __version__ = pkg_resources.require(pkgname)[0].version to set my libraries' version and I'm suddenly realizing I never really looked at how expensive that call is.

@jleclanche
Copy link

I also noticed that the bigger the lambda the quicker it can actually read files from the file system (which is done as part of the module import) - on 128MB the get_data took 2.5 seconds, on 512MB it took 0.57s and on 1024MB it took 0.32s

This is because you're bumping up the CPU of the lambda every time you increase its RAM. Anything CPU bound will linearly increase accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants