How to determine the cause of an AKS kubernetes cluster failure












0















I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:



image 1



From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.



I can see from the insights grid that the issue starts at around 9.50pm last night



image 2



I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.



Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.



My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?










share|improve this question

























  • what about platform level? this could have been a platform level issue

    – 4c74356b41
    Jan 18 at 17:40











  • Do the logs show anything? Have you tried seeing your pod/container logs?

    – Rico
    Jan 18 at 17:50











  • Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

    – Aditya Sundaramurthy
    Jan 18 at 20:01











  • Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

    – Aditya Sundaramurthy
    Jan 18 at 20:03











  • See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

    – Charles Xu
    Jan 19 at 1:44
















0















I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:



image 1



From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.



I can see from the insights grid that the issue starts at around 9.50pm last night



image 2



I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.



Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.



My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?










share|improve this question

























  • what about platform level? this could have been a platform level issue

    – 4c74356b41
    Jan 18 at 17:40











  • Do the logs show anything? Have you tried seeing your pod/container logs?

    – Rico
    Jan 18 at 17:50











  • Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

    – Aditya Sundaramurthy
    Jan 18 at 20:01











  • Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

    – Aditya Sundaramurthy
    Jan 18 at 20:03











  • See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

    – Charles Xu
    Jan 19 at 1:44














0












0








0








I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:



image 1



From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.



I can see from the insights grid that the issue starts at around 9.50pm last night



image 2



I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.



Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.



My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?










share|improve this question
















I have a production AKS kubernetes cluster that hosted in uk-south that has become unstable and unresponsive:



image 1



From the image, you can see that I have several pods in varying states of unready ie terminating/unknown, and the ones the report to be running are inaccessible.



I can see from the insights grid that the issue starts at around 9.50pm last night



image 2



I've scoured through the logs in the AKS service itself and through the Kibana logs for the apps running on the cluster around the time of the failure but I am struggling to see anything that looks to have caused this.



Luckily I have two clusters serving production under a traffic manager so have routed all traffic to the healthy one but my worry is that I need to understand what caused this, especially if the same happens on the other one as there will be production downtime while I spin up a new cluster.



My question is am I missing any obvious places to look for information on what caused the issue? any event logs that may point to what the problem is?







azure kubernetes azure-kubernetes azure-aks






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jan 18 at 17:49









Rico

27.4k94865




27.4k94865










asked Jan 18 at 17:39









Declan McNultyDeclan McNulty

1,38841837




1,38841837













  • what about platform level? this could have been a platform level issue

    – 4c74356b41
    Jan 18 at 17:40











  • Do the logs show anything? Have you tried seeing your pod/container logs?

    – Rico
    Jan 18 at 17:50











  • Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

    – Aditya Sundaramurthy
    Jan 18 at 20:01











  • Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

    – Aditya Sundaramurthy
    Jan 18 at 20:03











  • See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

    – Charles Xu
    Jan 19 at 1:44



















  • what about platform level? this could have been a platform level issue

    – 4c74356b41
    Jan 18 at 17:40











  • Do the logs show anything? Have you tried seeing your pod/container logs?

    – Rico
    Jan 18 at 17:50











  • Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

    – Aditya Sundaramurthy
    Jan 18 at 20:01











  • Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

    – Aditya Sundaramurthy
    Jan 18 at 20:03











  • See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

    – Charles Xu
    Jan 19 at 1:44

















what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40





what about platform level? this could have been a platform level issue

– 4c74356b41
Jan 18 at 17:40













Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50





Do the logs show anything? Have you tried seeing your pod/container logs?

– Rico
Jan 18 at 17:50













Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01





Have you enabled log collection? If you have, the kubernetes-api-server logs are written to a storage blob or aggregated in a loganalytics instance.

– Aditya Sundaramurthy
Jan 18 at 20:01













Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03





Also, AKS recently switched from Docker community to Moby. We're have problems with our clusters ever since they switched. Particularly with respect to the Docker daemon process becoming unresponsive.

– Aditya Sundaramurthy
Jan 18 at 20:03













See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44





See from your description and the photo, I think there are two possible reasons: 1. the resources are not enough, 2. there something error in your image or application or configuration. You can check pod or service with the command kubectl describe pod/service podName/serviceName.

– Charles Xu
Jan 19 at 1:44












1 Answer
1






active

oldest

votes


















1














I would suggest examining K8s event log around the time your nodes went "not ready".



Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.



You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):



let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeEvents_CL
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc


See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?



let startDateTime = datetime('2019-01-01T13:45:00.000Z');
let endDateTime = datetime('2019-01-02T13:45:00.000Z');
KubeNodeInventory
| where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
| order by TimeGenerated desc





share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54258956%2fhow-to-determine-the-cause-of-an-aks-kubernetes-cluster-failure%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I would suggest examining K8s event log around the time your nodes went "not ready".



    Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.



    You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):



    let startDateTime = datetime('2019-01-01T13:45:00.000Z');
    let endDateTime = datetime('2019-01-02T13:45:00.000Z');
    KubeEvents_CL
    | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
    | order by TimeGenerated desc


    See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?



    let startDateTime = datetime('2019-01-01T13:45:00.000Z');
    let endDateTime = datetime('2019-01-02T13:45:00.000Z');
    KubeNodeInventory
    | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
    | order by TimeGenerated desc





    share|improve this answer




























      1














      I would suggest examining K8s event log around the time your nodes went "not ready".



      Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.



      You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):



      let startDateTime = datetime('2019-01-01T13:45:00.000Z');
      let endDateTime = datetime('2019-01-02T13:45:00.000Z');
      KubeEvents_CL
      | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
      | order by TimeGenerated desc


      See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?



      let startDateTime = datetime('2019-01-01T13:45:00.000Z');
      let endDateTime = datetime('2019-01-02T13:45:00.000Z');
      KubeNodeInventory
      | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
      | order by TimeGenerated desc





      share|improve this answer


























        1












        1








        1







        I would suggest examining K8s event log around the time your nodes went "not ready".



        Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.



        You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):



        let startDateTime = datetime('2019-01-01T13:45:00.000Z');
        let endDateTime = datetime('2019-01-02T13:45:00.000Z');
        KubeEvents_CL
        | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
        | order by TimeGenerated desc


        See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?



        let startDateTime = datetime('2019-01-01T13:45:00.000Z');
        let endDateTime = datetime('2019-01-02T13:45:00.000Z');
        KubeNodeInventory
        | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
        | order by TimeGenerated desc





        share|improve this answer













        I would suggest examining K8s event log around the time your nodes went "not ready".



        Try open "Insights" Nodes tab and choose timeframe up top around the time when things went wrong. See what node statuses are. Any pressures? You can see that in the property panel to the right of the node list. Property panel also contains a link to event logs for that timeframe... Note though, link to event logs on the node's property panel constructs a complicated query to show only events tagged with that node.



        You can get this information with simpler queries (and run more fun queries as well) in the Logs. Open "Logs" tab in the left menu on the cluster and execute query similar to this one (change the time interval to the one you need):



        let startDateTime = datetime('2019-01-01T13:45:00.000Z');
        let endDateTime = datetime('2019-01-02T13:45:00.000Z');
        KubeEvents_CL
        | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
        | order by TimeGenerated desc


        See if you have events indicating what went wrong. Also of interest you can look at node inventory on your cluster. Nodes report K8s status. It was "Ready" prior to the problem... Then something went wrong - what is the status? Out of Disk by chance?



        let startDateTime = datetime('2019-01-01T13:45:00.000Z');
        let endDateTime = datetime('2019-01-02T13:45:00.000Z');
        KubeNodeInventory
        | where TimeGenerated >= startDateTime and TimeGenerated < endDateTime
        | order by TimeGenerated desc






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Jan 18 at 20:01









        VitalyVitaly

        112




        112






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f54258956%2fhow-to-determine-the-cause-of-an-aks-kubernetes-cluster-failure%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Liquibase includeAll doesn't find base path

            How to use setInterval in EJS file?

            Petrus Granier-Deferre