-
Bug
-
Resolution: Done
-
Medium
-
Casablanca Release, Dublin Release
-
None
-
El Alto Preliminary
In this environment, the symptom of the problem was a GET for AAI-Resources URL started to return a 403 Forbidden "Access Denied" error, where it previously was working as expected. None of the AAI calls were working at all.
curl --verbose --silent --insecure --user AAI:AAI --header 'Accept: application/json' --header 'X-TransactionId: testaai' --header 'Content-Type: application/json' --header 'X-FromAppId: AAI' https://${AAI_IP}:30233/aai/v14/network/collections < HTTP/1.1 403 Forbidden^M < Connection: close^M < Date: Tue, 04 Jun 2019 07:17:22 GMT^M < Content-Type: application/json;charset=utf-8^M < Strict-Transport-Security: max-age=16000000; includeSubDomains; preload;^M < ^M { [data not shown] * Closing connection 0 { "timestamp": 1559632642839, "status": 403, "error": "Forbidden", "message": "Access Denied", "path": "/aai/v14/network/collections" }
Further investigation showed that the healthchecks were passing for both AAF and AAI, but there were errors in the logs for pods:
- dev-aaf-aaf-sms
- dev-aaf-aaf-sms-quorumclient-0
- dev-aaf-aaf-sms-quorumclient-1
- dev-aaf-aaf-sms-quorumclient-2
The logs show errors for an earlier date, but no logs at all for recent dates (not sure if this means its currently working OK or not).
onap-hk@kubectl:~$ kubectl -n onap logs dev-aaf-aaf-sms-quorumclient-0 | less INFO: 2019/02/25 11:15:00 quorumclient.go:77: Starting Log for Quorum Client ERROR: 2019/02/25 11:15:00 auth.go:226: Read from file: open auth/dev-aaf-aaf-sms-quorumclient-0/id: no such file or directory ERROR: 2019/02/25 11:15:00 quorumclient.go:86: Read ID: open auth/dev-aaf-aaf-sms-quorumclient-0/id: no such file or directory WARNING: 2019/02/25 11:15:00 quorumclient.go:87: Unable to find an ID for this client. Generating... ERROR: 2019/02/25 11:15:00 auth.go:226: Read from file: open auth/dev-aaf-aaf-sms-quorumclient-0/shard: no such file or directory ERROR: 2019/02/25 11:15:00 quorumclient.go:99: Read Shard: open auth/dev-aaf-aaf-sms-quorumclient-0/shard: no such file or directory WARNING: 2019/02/25 11:15:00 quorumclient.go:100: Unable to find a shard file. Registering with SMS... ERROR: 2019/02/25 11:15:00 auth.go:226: Read from file: open auth/dev-aaf-aaf-sms-quorumclient-0/prkey: no such file or directory ERROR: 2019/02/25 11:15:00 quorumclient.go:40: LoadPGP Private Key: open auth/dev-aaf-aaf-sms-quorumclient-0/prkey: no such file or directory INFO: 2019/02/25 11:15:00 quorumclient.go:41: No Private Key found. Generating... ERROR: 2019/02/25 11:17:18 quorumclient.go:176: Connect to SMS: Get https://aaf-sms.onap:10443/v1/sms/quorum/status: dial tcp 10.247.37.16:10443: connect: connection timed out ERROR: 2019/02/25 11:19:25 quorumclient.go:176: Connect to SMS: Get https://aaf-sms.onap:10443/v1/sms/quorum/status: dial tcp 10.247.37.16:10443: connect: connection timed out ERROR: 2019/02/25 11:21:32 quorumclient.go:176: Connect to SMS: Get https://aaf-sms.onap:10443/v1/sms/quorum/status: dial tcp 10.247.37.16:10443: connect: connection timed out ERROR: 2019/02/25 11:23:39 quorumclient.go:176: Connect to SMS: Get https://aaf-sms.onap:10443/v1/sms/quorum/status: dial tcp 10.247.37.16:10443: connect: connection timed out ERROR: 2019/02/25 11:25:47 quorumclient.go:176: Connect to SMS: Get https://aaf-sms.onap:10443/v1/sms/quorum/status: dial tcp 10.247.37.16:10443: connect: connection timed out
The logs for AAI-Resources seemed to show a problem with aaf_locate.
onap-hk@kubectl:~$ kubectl -n onap logs dev-aai-aai-resources-7c8b66fbf4-pkplj aai-resources | less 2019-06-04T07:01:54.453+0000: Error connecting https://aaf-locate.onap:8095/locat e/AAF_NS.service:2.0 for location. org.onap.aaf.misc.env.APIException: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown at org.onap.aaf.cadi.http.HClient.send(HClient.java:153) at org.onap.aaf.cadi.aaf.v2_0.AAFLocator.refresh(AAFLocator.java:109) 2019-06-04T07:01:54.454+0000 WARN [cadi] Authenticating Service unavailable from 172.16.0.72:35486 ( No Entries found for 'https://aaf-locate.onap:8095/locate/AAF_NS.service:2.0' ) 2019-06-04T07:01:54.454+0000 AUDIT [cadi] No TAF will authorize for request from 172.16.0.72:35486
There was a message on onap-discuss about certificates expiring 31st May, but not sure if that is also the case here:
https://lists.onap.org/g/onap-discuss/topic/31911771
The pod and container for aaf_locate was still running.
onap-hk@kubectl:~$ kubectl -n onap get pods | grep aaf-locate dev-aaf-aaf-locate-55cf75bbc8-4m7gd 1/1 Running 0 99d
The aaf-locate logs only shows the same registration endpoint call repeated every few minutes:
onap-hk@kubectl:~$ kubectl -n onap logs dev-aaf-aaf-locate-55cf75bbc8-4m7gd | less
2019-06-04 07:05:10,385 WARN [service] 2019-06-04T07:05:10.385+0000 WARN [service] user=aaf@aaf.osaaf.org[x509],ip=172.16.0.49,port=59432,ms=0.311474,meth=PUT,path=/registration,msg="Put Mgmt Endpoints"
Checking on the aai-haproxy pod shows that it has detected several events of unhealthiness:
$ kubectl -n onap describe pod dev-aai-aai-5f48cc456-pwscq Name: dev-aai-aai-5f48cc456-pwscq Namespace: onap Priority: 0 PriorityClassName: <none> Node: 192.168.1.200/192.168.1.200 Start Time: Mon, 25 Feb 2019 11:13:42 +0800 Labels: app=aai pod-template-hash=190477012 release=dev-aai Annotations: checksum/config=edc35b32bd1b7973d32b6ed47481fe22c7596a491b213ea093f0bf9021c7daf6 kubernetes.io/availablezone=ap-southeast-1b Status: Running IP: 172.16.0.72 Controlled By: ReplicaSet/dev-aai-aai-5f48cc456 Init Containers: aai-readiness: Container ID: docker://ba7d49090e2db4172d07c3348d5ddc4e82c5060a29bcce590db2e812dcf2e041 Image: oomk8s/readiness-check:2.0.0 Image ID: docker-pullable://oomk8s/readiness-check@sha256:7daa08b81954360a1111d03364febcb3dcfeb723bcc12ce3eb3ed3e53f2323ed Port: <none> Host Port: <none> Command: /root/ready.py Args: --container-name aai-resources --container-name aai-traversal --container-name aai-graphadmin State: Terminated Reason: Completed Exit Code: 0 Started: Mon, 25 Feb 2019 11:47:48 +0800 Finished: Mon, 25 Feb 2019 11:53:15 +0800 Ready: True Restart Count: 3 Environment: NAMESPACE: onap (v1:metadata.namespace) Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-zgjkm (ro) Containers: aai: Container ID: docker://2d5a500a10c133388e842755782d2001adf3096bf3b22c491fb566f8edf5c92c Image: docker.io/aaionap/haproxy:1.2.4 Image ID: docker-pullable://aaionap/haproxy@sha256:002d2cc8519fdeb082943f9bd5cf5c4bd4f489199e6c3e9697fda46259522d3e Ports: 8080/TCP, 8443/TCP Host Ports: 0/TCP, 0/TCP State: Running Started: Mon, 25 Feb 2019 11:54:01 +0800 Ready: True Restart Count: 0 Liveness: tcp-socket :8443 delay=10s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:8443/aai/util/echo delay=10s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /dev/log from aai-service-log (rw) /etc/localtime from localtime (ro) /usr/local/etc/haproxy/haproxy.cfg from haproxy-cfg (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-zgjkm (ro) Conditions: Type Status Initialized True Ready True ContainersReady True PodScheduled True Volumes: localtime: Type: HostPath (bare host directory volume) Path: /etc/localtime HostPathType: aai-service-log: Type: HostPath (bare host directory volume) Path: /dev/log HostPathType: haproxy-cfg: Type: ConfigMap (a volume populated by a ConfigMap) Name: aai-deployment-configmap Optional: false default-token-zgjkm: Type: Secret (a volume populated by a Secret) SecretName: default-token-zgjkm Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 2m (x5 over 100d) kubelet, 192.168.1.200 Readiness probe failed: Get https://172.16.0.72:8443/aai/util/echo: net/http: request canceled (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 21s (x19 over 99d) kubelet, 192.168.1.200 Readiness probe failed: HTTP probe failed with statuscode: 503 Normal Healthy 11s (x8 over 100d) kubelet, 192.168.1.200 container docker://2d5a500a10c133388e842755782d2001adf3096bf3b22c491fb566f8edf5c92c in health status
The system is definitely not healthy, but kubernetes is not doing anything to recover it, i.e. no pods are being restarted, and the externally visible status of the pods/containers seem to falsely claim everything is running fine.
- relates to
-
AAI-2082 aai-resources gives incorrect output when aai-cassandra has shutdown with failure
- Closed
-
AAI-2282 Update certificate for Casablanca 3.0.2
- Closed
-
AAI-2312 AAI Resilience OOM pod deployment dependencies
- Closed
-
AAI-2340 AAI Resilience deeper health checks
- Closed
- mentioned in
-
Page Loading...