Grafana & Prometheus: receive alerts on Telegram

Continuing with this set of post about Grafana & Prometheus, we are going now to see how to make a Telegram bot to alert us if something it’s off with our monitoring services. If you miss the Part 1 you can click here. There you’re going to find how to install both services using docker with node_exporter and cAdvisor. Also, we search dashboards on Grafana dashboards and setting up on node_exporter and cAdvisor. Now, in this Part 2, we are going to create alerts for our server. Why we need alerts? We can’t pass all day checking Grafana dashboards, having alerts it’s going to make our work a lot easier. CPU load too high? Alert me. Out of RAM? Alert me. Network at 100%? Alert me. This and so much more are classic types of alerts. Prometheus give us an option to make alerts, but in this case we are going to use Grafana option.

Creating a bot on Telegram and configuring our first alert

We are going to start going to Grafana and open “Alerting”:

We are going to choose “Notification channels” and then “Add channel”:

On name put whatever you want, I’m going to set it “Grafana”. In “Type” we can choose a wide variety of communications channels, like emails, Team, LINE, Slack, etc. We are going to use Telegram just because It’s my main communication channel (and this is my blog, lol). I think we are going to need a part 3 for this wide tutorial, there I can set an email or Slack alerts too.

After setting the name and Telegram, we are going to have 2 new options we have to complete: BOT API Token and Chat ID. For this, we have to open Telegram (web or app) and chat with BotFather. Go to the “Search” bar and write “botfather”, this is the bot that create bots for Telegram, more info here. In the next image, you can see the full conversation I have with him. I’m going to explain it below:

I’m going to cut this because is looking awful all together and mobile user are gonna hate me

I’m using Telegram web app for Linux. So we need to start this chat with “/start” or just click on the app where you are going to see all the chat options. This is going to be a very simple because in every step BotFather is going to tell us how to proceed. Then pick a name and an username, the name can be anything, but the username it’s going to be unique. After that, BotFather is going to tell us the Bot Token we need. So we are going to paste it on Grafana new alert:

Going back to Telegram, create a new group, name it and then add this new bot:

Now we only need the Chat ID, this is a little tricky, first we need to open the follow link in our browser

api.telegram.org/bot"botToken"/getUpdates
e.g.
api.telegram.org/bot12345678912:ashduywjslkiAHNFYWN_jhgsxHSUJD1212/getUpdates

You’re going to see something like this, you can see where the arrow is pointing, that’s the chat id we need to paste on Grafana

Going back to Grafana, we need to click on notification setting:

The first option “Use this notification for all alerts”, this is going to make this channel the only one for all alerts. I think this isn’t the best solution, we can keep a better control of all alerts if we divided on different channels.

The next option is interesting, we can make a screenshot of the dashboard and send it via Telegram. For this we first need a new container that resolve it, so in this time we are going to keep it disable.
“Disable Resolve Message” if we tick this option, we are going to stop receiving alerts when the problem is solved. For example, if the CPU is at 100% and then goes to normal, we are going to receive a new alert telling us that the CPU is back to normal. So this option is for disabling this.
“Send reminder” if we tick this, a new panel opens to set the time for reminders.

Now we are ready to test it!. If everything it’s okay, we can now click on “Test” and we are going to receive a new message on our group of Telegram. Is going to look like this:

Perfect! We can now see a test notification. The URL shows “localhost” because I’m not using a public IP. If you are doing this on your cloud or public server, you can see your domain or IP.

Creating a new panel on Grafana to receive alerts

Now, we are going to create a new panel in Grafana, why? because Grafana alerts needs queries that doesn’t have variables. If we test a panel previously downloaded, and it has variables, we are going to see a message like this:

So we are going to create a new panel just clicking on “Add Panel”:

Now create a simple panel with a PromQL query. What’s a PromQL? It’s Prometheus Query Language that’s going to be using to obtain data in real time. So 1 choose a name for your new panel, 2 make a simple description, 3 make a new query in this case is going to be one that take the average of CPU:

100 - (avg by (hostname)(irate(node_cpu_seconds_total{mode="idle"}[2m]))*100)

Then 4 use {{hostname}} to default name the patter, or you can name it what you like it. Then “Apply” and “Save” it.

Now if we go to “Alert” tab, we can finally add create a new alert.

Create Alert

Creating an alert it’s very intuitive and easy, but we are going step by step because this also can fail. In this case I want to be alerted if the CPU overpass the 90% of average of usage for 15m.

So in step 1 we are going to name the alert and set the evaluated time over a segment of time, in this case 1m every 5m. Then we can set all the conditions we need, in this case it’s going to be simple but I encourage you to check all the options. I choose avg(), but you can use min(), max(), diff(), and much more.

After that we are going to use our query and set “CPU Server” (this is the name I set to my metric rule, the default is “A”) then every what amount of time, and finally from when, the default is “now”. The final step of the condition in this case is “IS ABOVE” 90 (%), we can use other options like “IS BELOW”, “IT’S OUT OF RANGE” and more.

Then we can see the handling errors, what happen if we don’t get any values, and what happen if the execution fail or timeout. Here we have also some options, to be alerted, to maintain the last state or just simple keep it Ok.

Finally, we set the Telegram+Grafana service we set before as our notification server, there we can add some message, also we can add tags.

If we click on “Test rule” we can open the rule we just create to check carefully if everything is like we need, this remembers me a YAML file, but I don’t know if that is going to look on the server.

Now click on “Save” and let´s test this alert.

Stressing CPU to test alerts

If we now go to Alerting, we are going to see something like this: “UNKNOWN for seconds” this means the alert isn’t pass the time we set to test, so we need wait:

After the time we set, we can see a green heart, which means that the alert is up and running. If we see a broken pink heart, that means that something is wrong, and you need to check the alert with “Test rule”:

Now we just need to stress our CPU to check if everything it’s okay, for that I recommend you to install “stress” command:

$ sudo apt install stress

Then we stress the CPU using:

$ sudo stress -c 2 -i 1 -m 1 --vm-bytes 128M -t 300s

-c 2: spawn 2 workers spinning on sqrt()

-i 1: spawn 1 worker spinning on sync()

-m 1: spawn 1 worker spinning on malloc()/free()

–vm-bytes 128M: malloc 128MB per vm worker

-t 120s: Timeout after 300 seconds.

And voilà! We have our CPU alerts on Telegram!

This was a long trip, I have to study a lot on my free time to come with this post, I hope it could be useful for someone. I’m going to make more G&P post, but now I’m going to rest and write about other services.

Grafana & Prometheus: receive alerts on Telegram – Part 2

Leave a Comment Cancel Reply