It’s always hectic to debug production issues and it is always wise to have a predefined step to debug any issues. But there can be scenarios where those steps may not work. In those scenarios also you should always have a flow in mind what to check. In this article, we are going to see some points to mind while debugging
People generally develop hypotheses as to what must be wrong. Don’t start checking these hypotheses.
When everyone starts checking their hypotheses, they make a lot of changes and those can become hectic to keep track of. There should be an incident manager and he should take a proper decisions as to what to be check and has to note down properly.
First of all check the resources of the system like CPU, Memory, and disk and then how much they are used and if there is any pressure on them. You must read the below article for this and how it can help.
You will find people doing a restart as a solution to many problems. Mind it restart is not a solution. It is just a way to show you are lazy and don’t want to debug rather fix the system and live with the probability that it will happen again in the future.
Restarting is never a solution and you are just pushing the D day a bit more ahead in the future. If you don’t fix the problem it can harm you big time in the future.
Try to remove the possibilities that you think can be the reason one by one. Mind it one change at a time. If you start making multiple changes you may get stuck as there will be confusion on what change has fixed it.
Making one change at a time and keeping a record will help you to track down the problem and will be able to pinpoint the issue.
Understanding the flow of the system which you are managing will help you to reach the problem faster.
Is always better to have more knowledge of the system. The more you are informed about the system, the better decisions you can take to mitigate any issue.
Provision your system in such a way that you can see metrics and logs very easily and search in them.
Having visibility in what’s happening is best. If you are able to figure out from your metrics where the issue is of if the issue can happen, means you can predict when the issue can happen. You have done 50% of your DevOps right.
If you like the article please share and subscribe.