From 8335edfa5fbbe344c79a7d1bc37cf173b74f8972 Mon Sep 17 00:00:00 2001 From: udh-mendix <47151110+udh-mendix@users.noreply.github.com> Date: Wed, 13 Mar 2019 14:44:40 +0100 Subject: [PATCH 1/4] Updated alerts page * Added documentation for missing alerts for v4 applications * Cross checked and updated threshold values for existing alerts --- .../operate/monitoring-application-health.md | 40 +++++++++++++++---- 1 file changed, 32 insertions(+), 8 deletions(-) diff --git a/content/developerportal/operate/monitoring-application-health.md b/content/developerportal/operate/monitoring-application-health.md index a4ee5fd062a..0a7acfd863b 100644 --- a/content/developerportal/operate/monitoring-application-health.md +++ b/content/developerportal/operate/monitoring-application-health.md @@ -50,13 +50,37 @@ Warning Threshold | Not used. | Critical Threshold | If the runtime heartbeat of the application is more than 8 minutes, a critical alert will be generated. First actions to take | Check the application logs and [https://status.mendix.com/](https://status.mendix.com/). | +Application Container CPU Percentage | | +:---|:---| +Description | Track the CPU utlilzation of the application container | +Example message | Application container 34234543-6543-6543-6543-153d247b6543 - Instance Index: 0 has high CPU usage: 90.5 +Warning Threshold | CPU utilization is between 75% and 85%. | +Critical Threshold | CPU utilization is higher than 85%. +First actions to take | Inspect the trends for **Application node CPU usage** combined with all **Application Statistics** for anomalies and correlate those with application behavior. | + +Application Container Disk Percentage | | +:---|:---| +Description | Track the disk utlilzation for the database belonging to the application | +Example message | Application container 34234543-6543-6543-6543-153d247b6543 - Instance Index: 0 has high disk usage: 90.5 +Warning Threshold | Disk utilization is between 75% and 85%. | +Critical Threshold | Disk utilization is higher than 85%. +First actions to take | Inspect the trends for **Application node CPU usage** combined with all **Application Statistics** for anomalies and correlate those with application behavior. | + +Application Container Memory Percentage | | +:---|:---| +Description | Track the memory utlilzation for the database belonging to the application | +Example message | Application container 34234543-6543-6543-6543-153d247b6543 - Instance Index: 0 has high memory usage: 90.5 +Warning Threshold | Memory utilization is between 75% and 85%. | +Critical Threshold | memory utilization is higher than 85%. +First actions to take | Inspect the trends for **Application node operating system memory** combined with all **Application Statistics** for anomalies and correlate those with application behavior. | + CPU Credit Balance status | | :---|:---| Description | This only appears if you run into a warning or a critical Credit Balance. See section 3.2, [CPU Credits on AWS](#burstable) for a more detailed discussion Example message | Application container 34234543-6543-6543-6543-153d247b6543 - Instance Index: 0 has low cpu credit balance: 16.315528 -Warning Threshold | - -Critical Threshold | - -First actions to take | Inspect the trends for *Application node CPU usage* combined with *Application node disk throughput* and *Application node load* for anomalies and correlate those with application behavior. +Warning Threshold | Credit balance goes below 80.0 +Critical Threshold | Credit balance goes below 20.0 +First actions to take | Inspect the trends for **Application node CPU usage** combined with **Application node disk throughput** and **Application node load** for anomalies and correlate those with application behavior. Critical Logs | | :---|:---| @@ -68,10 +92,10 @@ First actions to take | Inspect the application log file and look up the CRITICA Database CPU Utilization | | :---|:---| -Description | The overall system load of the application server. High values indicate high CPU load because of application activity, and/or the CPU is spending too much time waiting for disk reads or writes. -Example message | Database has 95% CPU Utilization. +Description | Track the CPU utlilzation for the database belonging to the application +Example message | Database has 95% CPU utilization. Warning Threshold | Not used. -Critical Threshold | System load is higher than 90%. +Critical Threshold | CPU utilization is higher than 90%. First actions to take | Inspect the trends for **Application node CPU usage** combined with **Application node disk throughput** and **Application node load** for anomalies and correlate those with application behavior. Database Free Space | | @@ -137,8 +161,8 @@ CPU | | :---|:---| Description | Overall system load of the application server. High values indicate high CPU load because of application activity, and/or the CPU is spending too much time waiting for disk reads or writes. Example message | System load: CRITICAL (9.6, 10.19, 10.84) -Warning Threshold | System load is higher than 2.8 over the last 15, 10, and 5 minutes, with a maximum deviation of 20% between the values. -Critical Threshold | System load is higher than 6.0 over the last 15, 10 and 5 minutes, with a maximum deviation of 20% between the values. +Warning Threshold | System load is higher than 2.8 over the last 1, 5 and 15, with a maximum deviation of 20% between the values. +Critical Threshold | System load is higher than 6.0 over the last 1, 5 and 15 minutes, with a maximum deviation of 20% between the values. First actions to take | Inspect the trends for *Application node CPU usage* combined with *Application node disk throughput* and *Application node load* for anomalies and correlate those with application behavior. Critical Logs | | From 9ee54dc3aac7c4cb84da1122d21562366a6fdabf Mon Sep 17 00:00:00 2001 From: udh-mendix <47151110+udh-mendix@users.noreply.github.com> Date: Wed, 13 Mar 2019 15:15:05 +0100 Subject: [PATCH 2/4] added missing alerts for JVM --- .../operate/monitoring-application-health.md | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) diff --git a/content/developerportal/operate/monitoring-application-health.md b/content/developerportal/operate/monitoring-application-health.md index 0a7acfd863b..ff073824d11 100644 --- a/content/developerportal/operate/monitoring-application-health.md +++ b/content/developerportal/operate/monitoring-application-health.md @@ -122,6 +122,22 @@ Warning Threshold | Not used. Critical Threshold | If the application's virtual machine should be running but has completely disappeared, or if the JVM process does not respond to any signal anymore. First actions to take | Check the log files and application metrics for a possible cause of the crash. +Virtual Machine Error | | +:---|:---| +Description | Show the state of the application's virtual machine, or an error state when the issue occurs with a virtual machine. +Example message | Your application's virtual machine died because of non-recoverable error 1 times in the last minute. +Warning Threshold | Not used. +Critical Threshold | If the application's virtual machine should be running but has completely disappeared, or if the JVM process does not respond to any signal anymore. +First actions to take | Check the log files and application metrics for a possible cause of the error. + +Virtual Machine Out Of Memory | | +:---|:---| +Description | Show the state of the application's virtual machine, or an error state when the issue occurs with a virtual machine. +Example message | Your application's virtual machine ran out of memory and died 1 times in the last minute. +Warning Threshold | Not used. +Critical Threshold | If the JVM process has run out of memory and the application's virtual machine crashed. +First actions to take | Check the log files and application metrics for a possible cause of the crash. + ### 3.2 CPU Credits on AWS{#burstable} Apps running on Mendix Cloud V4 use AWS databases to store their data. These databases are classed as *burstable* and *non-burstable*. If a database is **burstable** this means that it has a specified performance baseline. This baseline is different for each sort of database and varies from 5% to 20%. From 76f7548c2ae0188edb11689d564fc9bac8da40c4 Mon Sep 17 00:00:00 2001 From: udh-mendix <47151110+udh-mendix@users.noreply.github.com> Date: Wed, 13 Mar 2019 15:21:57 +0100 Subject: [PATCH 3/4] fixed threshold for heartbeat alert --- .../operate/monitoring-application-health.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/developerportal/operate/monitoring-application-health.md b/content/developerportal/operate/monitoring-application-health.md index ff073824d11..be9528edc16 100644 --- a/content/developerportal/operate/monitoring-application-health.md +++ b/content/developerportal/operate/monitoring-application-health.md @@ -44,11 +44,11 @@ There are a number of categories on the status overview that are related to vari Application Status | | :---|:---| -Description | Our application runtime sends a heartbeat signal every few minutes to show it is still alive. If no heartbeat timeout has been received for some time, the system will generate an alert. The heartbeat value varies per cluster and we are reducing it over time, but in general is ~5 minutes.| +Description | Our application runtime sends a heartbeat signal every few minutes to show it is still alive. If no heartbeat timeout has been received for some time, the system will generate an alert.| Example message | No runtime heartbeat received for 8m45s. Warning Threshold | Not used. | -Critical Threshold | If the runtime heartbeat of the application is more than 8 minutes, a critical alert will be generated. -First actions to take | Check the application logs and [https://status.mendix.com/](https://status.mendix.com/). | +Critical Threshold | If the runtime heartbeat of the application has not been recieved for 5 minutes, a critical alert will be generated. +First actions to take | Check the application logs for issues. Also check [https://status.mendix.com/](https://status.mendix.com/) for any planned maintainance. | Application Container CPU Percentage | | :---|:---| From fb83ab6fa97b0de4e53719f4ad8d06f450ea8844 Mon Sep 17 00:00:00 2001 From: Mark van Ments <35492184+MarkvanMents@users.noreply.github.com> Date: Mon, 18 Mar 2019 10:59:53 +0100 Subject: [PATCH 4/4] Minor Proofreading --- .../developerportal/operate/monitoring-application-health.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/developerportal/operate/monitoring-application-health.md b/content/developerportal/operate/monitoring-application-health.md index be9528edc16..f90612a545e 100644 --- a/content/developerportal/operate/monitoring-application-health.md +++ b/content/developerportal/operate/monitoring-application-health.md @@ -116,7 +116,7 @@ First actions to take | Since the health check microflow is specific to your app Virtual Machine Crash | | :---|:---| -Description | Show the state of the application's virtual machine, or an error state when the issue occurs with a virtual machine. +Description | Show the state of the application's virtual machine, or an error state if there is an issue with a virtual machine. Example message | Your application's virtual machine died 1 times in the last minute. Warning Threshold | Not used. Critical Threshold | If the application's virtual machine should be running but has completely disappeared, or if the JVM process does not respond to any signal anymore. @@ -124,7 +124,7 @@ First actions to take | Check the log files and application metrics for a possib Virtual Machine Error | | :---|:---| -Description | Show the state of the application's virtual machine, or an error state when the issue occurs with a virtual machine. +Description | Show the state of the application's virtual machine, or an error state if there is an issue with a virtual machine. Example message | Your application's virtual machine died because of non-recoverable error 1 times in the last minute. Warning Threshold | Not used. Critical Threshold | If the application's virtual machine should be running but has completely disappeared, or if the JVM process does not respond to any signal anymore.