How to determine the cause of an AKS kubernetes cluster failure
I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:
From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.
I can see from the insights grid that the issue starts at around 9.50pm last night
I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.
Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.
My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?
azure kubernetes azure-kubernetes azure-aks
|
show 1 more comment
I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:
From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.
I can see from the insights grid that the issue starts at around 9.50pm last night
I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.
Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.
My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?
azure kubernetes azure-kubernetes azure-aks
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the commandkubectl describe pod/service podName/serviceName
.
– Charles Xu
Jan 19 at 1:44
|
show 1 more comment
I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:
From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.
I can see from the insights grid that the issue starts at around 9.50pm last night
I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.
Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.
My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?
azure kubernetes azure-kubernetes azure-aks
I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:
From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.
I can see from the insights grid that the issue starts at around 9.50pm last night
I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.
Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.
My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?
azure kubernetes azure-kubernetes azure-aks
azure kubernetes azure-kubernetes azure-aks
edited Jan 18 at 17:49
Rico
27.4k94865
27.4k94865
asked Jan 18 at 17:39
Declan McNultyDeclan McNulty
1,38841837
1,38841837
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the commandkubectl describe pod/service podName/serviceName
.
– Charles Xu
Jan 19 at 1:44
|
show 1 more comment
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the commandkubectl describe pod/service podName/serviceName
.
– Charles Xu
Jan 19 at 1:44
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command
kubectl describe pod/service podName/serviceName
.– Charles Xu
Jan 19 at 1:44
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command
kubectl describe pod/service podName/serviceName
.– Charles Xu
Jan 19 at 1:44
|
show 1 more comment
1 Answer
1
active
oldest
votes
I would suggest examining K8s event log around the time your nodes went "not ready".
Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.
You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54258956%2fhow-to-determine-the-cause-of-an-aks-kubernetes-cluster-failure%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I would suggest examining K8s event log around the time your nodes went "not ready".
Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.
You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
add a comment |
I would suggest examining K8s event log around the time your nodes went "not ready".
Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.
You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
add a comment |
I would suggest examining K8s event log around the time your nodes went "not ready".
Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.
You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
I would suggest examining K8s event log around the time your nodes went "not ready".
Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.
You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?
let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc
answered Jan 18 at 20:01
VitalyVitaly
112
112
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54258956%2fhow-to-determine-the-cause-of-an-aks-kubernetes-cluster-failure%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
what about platform level? this could have been a platform level issue
– 4c74356b41
Jan 18 at 17:40
Do the logs show anything? Have you tried seeing your pod/container logs?
– Rico
Jan 18 at 17:50
Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.
– Aditya Sundaramurthy
Jan 18 at 20:01
Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.
– Aditya Sundaramurthy
Jan 18 at 20:03
See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command
kubectl describe pod/service podName/serviceName
.– Charles Xu
Jan 19 at 1:44