I've an application, and I'm running one instance of this application per AWS region.
I'm trying to instrument the application code with Prometheus metrics client, and will be exposing the collected metrics to the /metrics
endpoint. There is a central server which will scrape the /metrics
endpoints across all the regions and will store them in a central Time Series Database.
Let's say I've defined a metric named: http_responses_total
then I would like to know its value aggregated over all the regions along with individual regional values.
How do I store this region
information which could be any one of the 13 regions and env
information which could be dev
or test
or prod
along with metrics so that I can slice and dice metrics based on region
and env
?
I found a few ways to do it, but not sure how it's done in general, as it seems a pretty common scenario:
- Storing
region
andenv
info as labels with each of the metrics (not recommended: https://prometheus.io/docs/instrumenting/writing_exporters/#target-labels-not-static-scraped-labels) - Using target labels - I have
region
andenv
value with me in the application and would like to set this information from the application itself instead of setting them in scrape config - Keeping a separate gauge metric to record
region
andenv
info as labels (like described here: https://www.robustperception.io/exposing-the-software-version-to-prometheus) - this is how I'm planning to store my applicationversion
info in tsdb but the difference between appversion
info andregion
info is: the version keeps changing across releases however region is which I get from the config file is constant. So, not sure if this is a good way to do it.
I'm new to Prometheus. Could someone please suggest how I should store this region
and env
information? Are there any other better ways?
CodePudding user response:
All the proposed options will work, and all of them have downsides.
The first option (having env
and region
exposed by the application with every metric) is easy to implement but hard to maintain. Eventually somebody will forget to about these, opening a possibility for an unobserved failure to occur. Aside from that, you may not be able to add these labels to other exporters, written by someone else. Lastly, if you have to deal with millions of time series, more plain text data means more traffic.
The third option (storing these labels in a separate metric) will make it quite difficult to write and understand queries. Take this one for example:
sum by(instance) (node_arp_entries) and on(instance) node_exporter_build_info{version="0.17.0"}
It calculates a sum
of node_arp_entries
for instances with node-exporter version="0.17.0"
. Well more specifically it calculates a sum for every instance and then just drops those with a wrong version, but you got the idea.
The second option (adding these labels with Prometheus as a part of scrape configuration) is what I would choose. To save the words, consider this monitoring setup:
Datacener Prometheus | Regional Prometheus | Global Prometheus |
---|---|---|
1. Collects metrics from local instances. 2. Adds dc label to each metric. 3. Pushes the data into the regional Prometheus -> |
1. Collects data on datacenter scale. 2. Adds region label to all metrics. 3. Pushes the data into the global instance -> |
Simply collects and stores the data on global scale |
This is the kind of setup you need on Google scale, but the point is the simplicity. It's perfectly clear where each label comes from and why. This approach requires you to make Prometheus configuration somewhat more complicated, and the less Prometheus instances you have, the more scrape configurations you will need. Overall, I think, this option beats the alternatives.