How to determine the cause of an AKS kubernetes cluster failure

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.

I can see from the insights grid that the issue starts at around 9.50pm last night

I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.

Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.

My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40

Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50

Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01

Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03

See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44

|
show 1 more comment

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.

I can see from the insights grid that the issue starts at around 9.50pm last night

My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40

Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50

Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01

Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03

See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44

|
show 1 more comment

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.

I can see from the insights grid that the issue starts at around 9.50pm last night

My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:

From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.

I can see from the insights grid that the issue starts at around 9.50pm last night

My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?

azure kubernetes azure-kubernetes azure-aks

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

edited Jan 18 at 17:49

Rico

27.4k94865

edited Jan 18 at 17:49

Rico

27.4k94865

edited Jan 18 at 17:49

Rico

27.4k94865

asked Jan 18 at 17:39

Declan McNulty

1,38841837

asked Jan 18 at 17:39

Declan McNulty

1,38841837

asked Jan 18 at 17:39

Declan McNulty

1,38841837

what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40

Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50

Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01

Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03

See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44

|
show 1 more comment

what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40

Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50

Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01

Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03

See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44

what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40

Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50

Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01

Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03

See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44

|
show 1 more comment

1 Answer
1

active

oldest

votes

I would suggest examining K8s event log around the time your nodes went "not ready".

Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.

You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeEvents_CL

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeNodeInventory

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

answered Jan 18 at 20:01

Vitaly

112

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54258956%2fhow-to-determine-the-cause-of-an-aks-kubernetes-cluster-failure%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I would suggest examining K8s event log around the time your nodes went "not ready".

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeEvents_CL

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeNodeInventory

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

answered Jan 18 at 20:01

Vitaly

112

add a comment |

I would suggest examining K8s event log around the time your nodes went "not ready".

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeEvents_CL

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeNodeInventory

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

answered Jan 18 at 20:01

Vitaly

112

add a comment |

I would suggest examining K8s event log around the time your nodes went "not ready".

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeEvents_CL

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeNodeInventory

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

answered Jan 18 at 20:01

Vitaly

112

I would suggest examining K8s event log around the time your nodes went "not ready".

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeEvents_CL

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

let startDateTime = datetime('2019-01-01T13:45:00.000Z');

let endDateTime = datetime('2019-01-02T13:45:00.000Z');

KubeNodeInventory

| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime

| order by TimeGenerated desc

answered Jan 18 at 20:01

Vitaly

112

answered Jan 18 at 20:01

Vitaly

112

answered Jan 18 at 20:01

Vitaly

112

answered Jan 18 at 20:01

Vitaly

112

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Brtdku