How do I instrument region and environment information correctly in Prometheus?-CodePudding

I've an application, and I'm running one instance of this application per AWS region. I'm trying to instrument the application code with Prometheus metrics client, and will be exposing the collected metrics to the /metrics endpoint. There is a central server which will scrape the /metrics endpoints across all the regions and will store them in a central Time Series Database.

Let's say I've defined a metric named: http_responses_total then I would like to know its value aggregated over all the regions along with individual regional values. How do I store this region information which could be any one of the 13 regions and env information which could be dev or test or prod along with metrics so that I can slice and dice metrics based on region and env?

I found a few ways to do it, but not sure how it's done in general, as it seems a pretty common scenario:

Storing region and env info as labels with each of the metrics (not recommended: https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels)
Using target labels - I have region and env value with me in the application and would like to set this information from the application itself instead of setting them in scrape config
Keeping a separate gauge metric to record region and env info as labels (like described here: https://www.robustperception.io/exposing-the-software-version-to-prometheus) - this is how I'm planning to store my application version info in tsdb but the difference between app version info and region info is: the version keeps changing across releases however region is which I get from the config file is constant. So, not sure if this is a good way to do it.

I'm new to Prometheus. Could someone please suggest how I should store this region and env information? Are there any other better ways?

CodePudding user response：

All the proposed options will work, and all of them have downsides.

The first option (having env and region exposed by the application with every metric) is easy to implement but hard to maintain. Eventually somebody will forget to about these, opening a possibility for an unobserved failure to occur. Aside from that, you may not be able to add these labels to other exporters, written by someone else. Lastly, if you have to deal with millions of time series, more plain text data means more traffic.

The third option (storing these labels in a separate metric) will make it quite difficult to write and understand queries. Take this one for example:

sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}

It calculates a sum of node_arp_entries for instances with node-exporter version="0.17.0". Well more specifically it calculates a sum for every instance and then just drops those with a wrong version, but you got the idea.

The second option (adding these labels with Prometheus as a part of scrape configuration) is what I would choose. To save the words, consider this monitoring setup:

Datacener Prometheus	Regional Prometheus	Global Prometheus
1. Collects metrics from local instances. 2. Adds `dc` label to each metric. 3. Pushes the data into the regional Prometheus ->	1. Collects data on datacenter scale. 2. Adds `region` label to all metrics. 3. Pushes the data into the global instance ->	Simply collects and stores the data on global scale

This is the kind of setup you need on Google scale, but the point is the simplicity. It's perfectly clear where each label comes from and why. This approach requires you to make Prometheus configuration somewhat more complicated, and the less Prometheus instances you have, the more scrape configurations you will need. Overall, I think, this option beats the alternatives.