Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trigger Alert CoreDNSErrorsHigh with OpenShift 4.11.9 #1388

Closed
rockwyc992 opened this issue Feb 8, 2023 · 3 comments
Closed

Trigger Alert CoreDNSErrorsHigh with OpenShift 4.11.9 #1388

rockwyc992 opened this issue Feb 8, 2023 · 3 comments

Comments

@rockwyc992
Copy link

Issue Description

I have deployed 3scale apimanager with 3scale operator 2.12. When these are huge number of connections, OpenShift would trigger Alert CoreDNSErrorsHigh .

There was also huge number of error logs, known as SERVFAIL, with CoreDNS, like below:

[INFO] 10.20.10.100:42782 - 766 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000114929s
[INFO] 10.20.10.100:32785 - 54091 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000045799s
[INFO] 10.20.10.100:38736 - 38959 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000495619s
[INFO] 10.20.10.100:53221 - 55209 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000111221s
[INFO] 10.20.10.100:43456 - 55627 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000087746s
[INFO] 10.20.10.100:53734 - 57461 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000093702s
[INFO] 10.20.10.100:56550 - 34310 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000131092s
[INFO] 10.20.10.100:40476 - 52329 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000077204s
[INFO] 10.20.10.100:48084 - 49744 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000089607s
[INFO] 10.20.10.100:43484 - 50657 "A IN backend-listener. udp 34 false 512" SERVFAIL qr,aa,rd,ra 34 0.000053069s

Version

OpenResty Version

sh-4.4$ resty -V    
resty 0.27
nginx version: openresty/1.19.3.1
built with OpenSSL 1.1.1k  FIPS 25 Mar 2021
TLS SNI support enabled
configure arguments: --prefix=/usr/local/openresty/nginx --with-cc-opt='-O2 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -fexceptions -fstack-protector-strong -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -DNGX_LUA_ABORT_AT_PANIC -I/usr/local/openresty/zlib/include -I/usr/local/openresty/pcre/include -I/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4/' --add-module=../ngx_devel_kit-0.3.1 --add-module=../echo-nginx-module-0.62 --add-module=../xss-nginx-module-0.06 --add-module=../ngx_coolkit-0.2 --add-module=../set-misc-nginx-module-0.32 --add-module=../form-input-nginx-module-0.12 --add-module=../encrypted-session-nginx-module-0.08 --add-module=../srcache-nginx-module-0.32 --add-module=../ngx_lua-0.10.19 --add-module=../ngx_lua_upstream-0.07 --add-module=../headers-more-nginx-module-0.33 --add-module=../array-var-nginx-module-0.05 --add-module=../memc-nginx-module-0.19 --add-module=../redis2-nginx-module-0.15 --add-module=../redis-nginx-module-0.3.7 --add-module=../ngx_stream_lua-0.0.9 --with-ld-opt='-Wl,-rpath,/usr/local/openresty/luajit/lib -Wl,-z,relro -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -L/usr/local/openresty/zlib/lib -L/usr/local/openresty/pcre/lib -Wl,-rpath,/usr/local/openresty/zlib/lib:/usr/local/openresty/pcre/lib -L/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4/' --with-pcre-jit --with-stream --with-stream_ssl_module --with-stream_ssl_preread_module --with-http_v2_module --without-mail_pop3_module --without-mail_imap_module --without-mail_smtp_module --with-http_stub_status_module --with-http_realip_module --with-http_addition_module --with-http_auth_request_module --with-http_secure_link_module --with-http_random_index_module --with-http_gzip_static_module --with-http_sub_module --with-http_dav_module --with-http_flv_module --with-http_mp4_module --with-http_gunzip_module --with-threads --add-module=/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../apicast-nginx-module-v0.4 --add-dynamic-module=/builddir/build/BUILD/openresty-72c2c44c0a02d07851998457c3367a72cdbad0ed/openresty-1.19.3.1/../nginx-opentracing-v0.3.0/opentracing --with-stream --with-http_ssl_module

API Cast Gateway Image Version

[user@localhost ~]$ oc image info registry.redhat.io/3scale-amp2/apicast-gateway-rhel8@sha256:e6970547e76e9671a11e45ebe55c1ccf8d3833e22ce232adb2fe963b0f0263cb
Name:          registry.redhat.io/3scale-amp2/apicast-gateway-rhel8@sha256:e6970547e76e9671a11e45ebe55c1ccf8d3833e22ce232adb2fe963b0f0263cb
Media Type:    application/vnd.docker.distribution.manifest.v2+json
Created:       147d ago
Image Size:    114.5MB in 3 layers
Layers:        79.43MB sha256:1b3417e31a5e0e64f861e121d4efed3152e75aaa85026cd784cd0070e063daa3
               1.796kB sha256:809fe483e88523e7021d76b001a552856f216430023bdc0aeff8fce8df385535
               35.03MB sha256:949460b6bc6cc93018087e3896c839f232ec361c13b612f5e5dab922f060d3bf
OS:            linux
Arch:          amd64
Entrypoint:    container-entrypoint
Command:       scripts/run
Working Dir:   /opt/app-root
User:          1001
Exposes Ports: 8080/tcp
Environment:   PATH=/opt/app-root/src/bin:/opt/app-root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
               container=oci
               AUTO_UPDATE_INTERVAL=0
               HOME=/opt/app-root/src
               BUILD_PATH=/remote-source/app
               LD_LIBRARY_PATH=:/opt/app-root/lib
               LUA_CPATH=./?.so;/usr/lib64/lua/5.1/?.so;/usr/lib64/lua/5.1/loadall.so;/usr/local/lib64/lua/5.1/?.so
               LUA_PATH=/usr/lib64/lua/5.1/?.lua;/usr/local/share/lua/5.1/?.lua;/usr/local/share/lua/5.1/*/?.lua;
Labels:        architecture=x86_64
               build-date=2022-09-13T21:46:03.758871
               com.redhat.build-host=cpt-1005.osbs.prod.upshift.rdu2.redhat.com
               com.redhat.component=3scale-amp-apicast-gateway-container
               com.redhat.license_terms=https://www.redhat.com/agreements
               description=APIcast is not a standalone API gateway therefore it needs connection to the 3scale API management platform. The container includes OpenResty and uses LuaRocks to install dependencies (rocks are installed in the application folder).
               distribution-scope=public
               io.k8s.description=APIcast is not a standalone API gateway therefore it needs connection to the 3scale API management platform. The container includes OpenResty and uses LuaRocks to install dependencies (rocks are installed in the application folder).
               io.k8s.display-name=3scale API gateway (APIcast)
               io.openshift.expose-services=8080:apicast
               io.openshift.tags=integration, nginx, lua, openresty, api, gateway, 3scale, rhamp
               [email protected]
               name=3scale-amp2/apicast-gateway-rhel8
               release=74
               summary=3scale's API gateway (APIcast) is an OpenResty application which consists of two parts: Nginx configuration and Lua files.
               url=https://access.redhat.com/containers/#/registry.access.redhat.com/3scale-amp2/apicast-gateway-rhel8/images/1.21.0-74
               vcs-ref=3e532f5dc893e3ed1d3d374d3f266e9704b358b0
               vcs-type=git
               vendor=Red Hat, Inc.
               version=1.21.0

Steps to Reproduce

  1. Change Environment APICAST_LOG_LEVEL to debug
  2. Watch out the logs of API Cast, there would show DNS resolve procedure, like below:

First time to request:

2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:350: lookup(): resolver query: backend-listener, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:188: get(): resolver cache miss: backend-listener., requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:122: fetch_answers(): resolver cache miss backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:188: get(): resolver cache miss: backend-listener.3scale.svc.cluster.local, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 dns_client.lua:75: query(): resolver query: backend-listener.3scale.svc.cluster.local nameserver: 10.10.0.10:53, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 cache.lua:74: store(): resolver cache write backend-listener.3scale.svc.cluster.local with TLL 5, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:410: get_servers(): query for backend-listener finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:350: lookup(): resolver query: 10.10.10.100, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:355: lookup(): host is ip address: 10.10.10.100, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:370: lookup(): resolver query: 10.10.10.100 finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 resolver.lua:410: get_servers(): query for 10.10.10.100 finished with 1 answers, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=A
2023/02/08 08:43:02 [debug] 22#22: *5 proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 0, requestID=A

Second time to request:

2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:350: lookup(): resolver query: backend-listener, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:188: get(): resolver cache miss: backend-listener., requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:115: fetch_answers(): resolver cache read backend-listener.3scale.svc.cluster.local 1 entries, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 cache.lua:190: get(): resolver cache hit: backend-listener.3scale.svc.cluster.local 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:410: get_servers(): query for backend-listener finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:350: lookup(): resolver query: 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:355: lookup(): host is ip address: 10.10.10.100, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:370: lookup(): resolver query: 10.10.10.100 finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 resolver.lua:410: get_servers(): query for 10.10.10.100 finished with 1 answers, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=B
2023/02/08 08:43:15 [debug] 22#22: *15 proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 2, requestID=B

Current Result

Current DNS resolve produce for API Cast, like below:

First Time to query backend-listener

  1. Query backend-listener
resolver.lua:350: lookup(): resolver query: backend-listener, requestID=A
  1. Query backend-listener., and got SERVFAIL for backend-listener.
resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=A
cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=A
dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=A
  1. Query backend-listener.3scale.svc.cluster.local. Found 10.10.10.100 and write cache to backend-listener.3scale.svc.cluster.local., not backend-listener
resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=A
cache.lua:122: fetch_answers(): resolver cache miss backend-listener.3scale.svc.cluster.local, requestID=A
dns_client.lua:75: query(): resolver query: backend-listener.3scale.svc.cluster.local nameserver: 10.10.0.10:53, requestID=A
cache.lua:74: store(): resolver cache write backend-listener.3scale.svc.cluster.local with TLL 5, requestID=A
  1. Found and connction established
resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=A
http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=A
proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 0, requestID=A

Second Time to query backend-listener

  1. Query backend-listener. I think cache shold be hit at this step.
resolver.lua:350: lookup(): resolver query: backend-listener, requestID=B
  1. Query backend-listener., still Got SERVFAIL for backend-listener.
resolver.lua:321: search_dns(): resolver query: backend-listener search: query: backend-listener., requestID=B
cache.lua:122: fetch_answers(): resolver cache miss backend-listener., requestID=B
dns_client.lua:75: query(): resolver query: backend-listener. nameserver: 10.10.0.10:53, requestID=B
  1. Query backend-listener.3scale.svc.cluster.local., cache hits for backend-listener.3scale.svc.cluster.local.
resolver.lua:321: search_dns(): resolver query: backend-listener search: 3scale.svc.cluster.local query: backend-listener.3scale.svc.cluster.local, requestID=B
cache.lua:115: fetch_answers(): resolver cache read backend-listener.3scale.svc.cluster.local 1 entries, requestID=B
  1. Found and connction established
resolver.lua:370: lookup(): resolver query: backend-listener finished with 1 answers, requestID=B
http.lua:50: connect(): connected to ip:10.10.10.100 host: 10.10.10.100 port: 3000 ok: 1 err: nil, requestID=B
proxy.lua:26: new(): connection to backend-listener:3000 established, reused times: 2, requestID=B

Expected Result

Second Time to query backend-listener, the cache should be hit at first step to avoid querying CoreDNS.

@rockwyc992
Copy link
Author

Maybe DNS cache save and check, should be move out.

https://github.com/3scale/APIcast/blame/master/gateway/src/resty/resolver.lua#L306

@eguzki
Copy link
Member

eguzki commented Feb 13, 2023

Thanks for looking at this and for the great description and detailed doc.

I looked at it and indeed it seems an issue.

We are currently working on other tasks with higher priorities. But I have created a Jira issue for this issue. https://issues.redhat.com/browse/THREESCALE-9301 so it can be added to the backlog for future sprints

@tkan145
Copy link
Contributor

tkan145 commented Nov 11, 2024

Fixed in #1500

@tkan145 tkan145 closed this as completed Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants