How to identify clear linux system is hanged

Is there any function or system level variable is available in kernel which will inform system is hanged ?

I should use watch dog timer for this use case ?

Consider what the application/infrastructure is that you’re trying to protect and adapt your monitoring based on operational realities you’ve seen. Different tools can fit some needs better than others but you’ll need more details.

For programs/daemons, systemd units provide some simple built-in ways to detect failed states and restart.

For a broader system, self-monitoring will only be so effective. If a system is not responsive it may not be able to communicate out that something is wrong. Monitoring from an external system usually helps better detect these things.

You can check the standard system health indicators to detect an issue (cpu load/wait, swapping, etc) but in my experience it is better to focus on checking the actual service running on the system that you care about (e.g. if you’re monitoring an HTTP server, make sure it is not just that it is online but that it also responding to HTTP traffic. And, that the response is what you’re expecting).

1 Like