Music has 32 Elements

When we listen to music, it has a direct connection between our brain and soul. Music is perceived mainly through the seven parts of our ear sensory surfaces and infiltrates our body without any…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

How to Serve 200K Samples per Second with Single Prometheus

What is IT monitoring and why is it essential?

IT monitoring is the process of gathering metrics about the operations of an IT environment’s hardware and software, to ensure that everything is functioning as expected, to support applications and services, and to optimize the infrastructure. Basic monitoring is performed through hardware, software, and network operation checks, while more advanced monitoring gives detailed views on operational statuses, including average response times, number of application instances, error rates, cross-application traces, application availability, latency, etc.
Monitoring is an essential tool in the life of a DevOps Engineer. It puts them on the frontline of the IT world, playing a vital role.
Not having a monitoring system is like trying to find something in a dark room with your eyes closed. Without monitoring, it would be difficult if not impossible to detect anomalies and issues which need to be resolved swiftly.
I would also like to mention that other tools also exist, which can help with anomaly detection and problem-solving.

What tools do we use for monitoring?

There are many monitoring tools, some of which are open-source. Each one of the tools mentioned has its advantages and disadvantages.
Furthermore, today’s topic suggests solutions through open-source tools, especially Prometheus and Thanos. I will explain how to build a monitoring system that can retain data for long periods, which can handle up to 200K samples per second. The important point is that all of these processes are realized on one centralized Prometheus and Thanos server.
To make the system easily manageable you can use config management tools such as Puppet and Ansible to deploy and oversee alerting rules, and to create and save backups of them.

What is Prometheus?

Config
Let’s talk about configurations.
Below are the Federation configs:

Img.1: Federation config for AWS Kubernetes cluster

I would like to discuss one important topic.
During the federation, do not collect all metrics and then drop:

Img. 2: The simple architecture of Thanos.

Img. 3: Thanos Sidecar

Img. 4: Thanos Store

Img. 5: Thanos Compact

Img. 6: The final architecture of the monitoring infrastructure

As shown in image 6 we have a distributed monitoring infrastructure, and the parameters of the servers are the same. I would provide the parameters of the monitoring-backend-server1 on which the Main Prometheus server is running.
From our experience, we can suggest a server with the following parameters:

Finally, in the image below, we can see the actual usage of resources by Prometheus:

Img.7: CPU usage of the Main Prometheus server

Img.8: Memory usage of the Main Prometheus server

Img.9: TSDB (disk) usage with 7 days retention

If Redis goes down nothing will happen. We must only start it.
We have a second Prometheus server, which is configured as Main Prometheus. The only difference is that it is passive and it serves as a reserve server. It means that if any incidents occur with the Main Prometheus e.g. the server goes down, the reserved server will automatically replace it by changing its state to active.
The next two images (Img. 10, Img. 11) show the performance of the system when querying with a caching component and without it. As we see in Img. 10 when we query data for the last 30 days it is executed in 4.4 seconds with the caching component. In Img. 11, we can see the result of the same query when we use caching of Prometheus-frontend. The query was executed in 0.3 seconds, so we have up to 14x time faster performance with caching.

Img.10: First time querying, with Thanos Query datasource (no cache)

Img. 11 Second time querying, with Prometheus datasource (with Cortex cache)

Now we are able to query Prometheus-frontend, so that every query is cached in RedisDB. This way, Grafana goes through Prometheus-frontend, which then connects to Thanos Query. Simultaneously, Thanos Query connects the Prometheus Sidecar and Store to collect all of the data (Data collected from Object Storage and Prometheus.) This way we just need to write the Prometheus-frontend URL:Port in our Grafana datasource, instead of Prometheus URL:Port.
This made it possible for us to have big data while it can be queried easily and swiftly. It makes no difference for us how big the data or the infrastructure can get in size. This means we can get this entire data through a single Prometheus server, which handles approximately 200,000 samples per second, with no downtime.

Img. 12 Samples per seconds appended in TSDB

To sum up, by having the required size, volume of metrics, quantity of servers and one main Prometheus Server, we can handle large quantities of data.
We also have other solutions related to Loki, ELK stack, Jaeger and other tools. It is worth mentioning that in large-scale companies like PicsArt, monitoring plays a major role and it is extremely important, not only as an alert for DevOps engineers, but also because such large volumes of data can be used in the analytics servers and services.

After the full deployment, we have noticed some “invisible” errors, and with the good dashboard, it was possible to decrease the error rate in the PicsArt environment 3x. Which definitely affects app quality and user experience. The troubleshooting time decreased from 15 min to 2–3 on infrastructure-related cases. With proactive monitoring, we have predicted and fixed more than 200 cases during the last year, which didn’t cause the system downtime
Lastly, note that any large-scale data needs to be smartly visualized.

Music has 32 Elements

How to Serve 200K Samples per Second with Single Prometheus

Add a comment

Related posts:

Partnership and Collaboration

PTE

Viens networker dans une villa de luxe !