In this article we will show you how a Prometheus query works and what it does internally to be able to improve the performance of its queries.
In our previous article we talked about some of the aspects around Prometheus design and its internal working. Now it’s time to look in more detail into how Prometheus performs a query to fetch the necessary samples of data for your query.
As we discussed in our previous article, Prometheus queries can get very slow and expensive if you we use labels with high-cardinality or if we use a time range that involves looking into multiple blocks. It’s response can become very slow to be practical for its use, specially in the case of dashboards (although there’s a solution to speed up dashboard load times). Why is this happening?
We’ll see why that happens soon, but let’s first try to understand how it works internally to be able to get to a conclusion!
Although we already covered how Prometheus stores data in our article “Understanding Prometheus”, let’s try to remember how its data gets structured with the following diagram:
As you can see, Prometheus groups data in blocks of 2 hours, keeping the current data in a WAL file. For each of these blocks we have an index, which will determine in what chunk can we find the time series we’re looking for.
Something that we should know is that Prometheus follows TSDB format for structuring its files.
An important part of TSDB format is its indexes, let’s see how indexing works!
The format of a Prometheus index files is a bit complex, so we’ll omit some details and try to summarise it in a way that can be understood by everyone.
Basically, there are two types of indexes in each
index file in Prometheus: posting index and series index.
Why is that? Well, first we need to know what series contain the labels we’re looking for, we use posting index for that. Once we have the different series that should contain the data we’re looking for, we need to know what chunks are those series located in. We use a series index to achieve that.
The structure of an
index file in TSDB format would look similar to this:
┌────────────────────────────┬─────────────────────┐ │ magic(0xBAAAD700) <4b> │ version(1) <1 byte> │ ├────────────────────────────┴─────────────────────┤ │ ┌──────────────────────────────────────────────┐ │ │ │ Symbol Table │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Series │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Label Index 1 │ │ │ ├──────────────────────────────────────────────┤ │ │ │ ... │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Label Index N │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Postings 1 │ │ │ ├──────────────────────────────────────────────┤ │ │ │ ... │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Postings N │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Label Offset Table │ │ │ ├──────────────────────────────────────────────┤ │ │ │ Postings Offset Table │ │ │ ├──────────────────────────────────────────────┤ │ │ │ TOC │ │ │ └──────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘
It’s also worth mentioning, in case you’ve missed that, that we’ll have an index file per each block (up to two hours of data) recorded in our Prometheus storage.
The symbols table contains a list of all the existing label-value pairs in the stored data, these pairs can be referenced from any of the subsequent sections.
The purpose of the symbols table is to save space in the mappings between each series and its labels. For example, if one thousand series are using the same label-value pairs, they all can share the same reference to the symbols table instead of defining the same pairs every time in each series section. This is clever, right?
Let’s look at the next section in the file now, the series section!
The series section contains a list of series containing the labels for that particular series and the list of chunks where the series samples are stored. Before we continue, let’s remember what a time series is: a time series is a data stream of values, associated by metric name and labels, each of them containing the timestamp representing the time at which the value was measured.
This means that each different series could represent either the same metric name with different label values, or a totally different metric name, with their corresponding combination of label-value pairs.
As we mentioned in the previous section, label-value pairs are stored in the symbols table, this means that inside the series section the label-value pairs are defined as a reference to the symbols table. This allows saving disk space for each index file.
For a given time series, we should have references containing non-duplicated label names, as we cannot have a metric grouped by two different label values. If we have data for the same metric name and a different label value, this will be grouped as part of a different time series. We hope this helps to understand how they’re grouped together.
What about the chunks? The chunks are defined in the last part of this section.
The content of the chunks is basically a set of metadata, which is mainly composed of a first field containing the number of chunks containing data for the current series and then, for each chunk, we have the minimum timestamp in the chunk for the current time series and the maximum timestamp for the current time series.
Why is this useful? Remember that every chunk can contain up to 512MB of data corresponding to different time series. Having the minimum and maximum timestamp in this chunk for the current time series, means that we can filter out a (potentially) big part of the chunk that we’re not interested in.
On top of these fields, we finally have a reference to where the data is located in the chunk file.
This is a simplified explanation of the chunks metadata, cause most of the fields are stored as deltas from the previous chunk, but let’s not get into too much detail to keep things simple and easy to understand.
You can find below a diagram about how the series section gets structured:
We’ll be skipping label indexes and label offset table sections, as they’re no longer in use. Let’s look at postings section next!
To understand how postings work, we have to look at two different sections that work together: the postings section and the posting offset table.
We should see postings as a list of published series grouped by label name and value. This takes a key part during the execution of a Prometheus query, as it allows to know what series are associated with a label name and value pair.
The posting offset table stores a sequence of label name and value pairs and associates them with an offset. This offset corresponds to the position of the series in the postings section. The list is sorted lexicographically by label name and value.
The postings section is basically a list of series references. This means that from a set of label name and value pairs, we can obtain the references to the series we are interested in.
As we saw earlier, knowing the series we can utilise the series section in the index to obtain what chunks should we access to obtain the samples corresponding to our query. That’s basically what a Prometheus query does!
If the diagram shown above you can see how every label pair gets mapped to a reference in the postings section, where we can find the reference to the corresponding series.
Before we wrap up everything we’ve learned today, let’s look at one last section: the table of contents.
Table of Contents (TOC)
The table of contents represents an entry point to the index, it’s like some sort of glossary where Prometheus can find where every section of the index starts and finishes.
Its format is as follows:
┌─────────────────────────────────────────┐ │ ref(symbols) <8b> │ ├─────────────────────────────────────────┤ │ ref(series) <8b> │ ├─────────────────────────────────────────┤ │ ref(label indices start) <8b> │ ├─────────────────────────────────────────┤ │ ref(label offset table) <8b> │ ├─────────────────────────────────────────┤ │ ref(postings start) <8b> │ ├─────────────────────────────────────────┤ │ ref(postings offset table) <8b> │ ├─────────────────────────────────────────┤ │ CRC32 <4b> │ └─────────────────────────────────────────┘
As you can see it contains pointers (or references) to each section of the index, this is very important to allow the algorithm to gather the right data from the index!
Putting everything together, we now know the basics to understand how a Prometheus query works.
We’ve seen how Prometheus uses the TOC to be able to read the index properly, once in the index it uses the postings section to get the series associated for each label-value pair in our query.
Having the series, it’s now time for the series section to take action! The series section allows to find the right chunks to be read for each of these time series.
You might be wondering now, how do we use the metric name in all this process? Well, in Prometheus a metric name is basically a label which under the covers is defined with the name “
This means that metric name will be just another label name and value pair in the set of labels we use in the postings section of our index!
Having said that, let’s try to make the whole process clearer with this diagram:
I hope it’s now clear how indexes come into play during the processing of a Prometheus query.
After everything we’ve learned around Prometheus internals, it’s not difficult to understand why Prometheus queries are sometimes very slow. The time it takes to process the different time series will depend highly on the cardinality of the labels and on the selected time range. Why?
First, the selected time range will determine the number of blocks we’ll have to access. Keep in mind that for each block we’ll have to process one index file and gather the targeted chunks after finding them using such index. Once we have all of them, we’ll have to merge them to present the whole data to the client.
Second, a high-cardinality of any of the labels will mean that many more “time series” will be created for each block. Why does this matter so much?
Let’s say that we have a label “a” containing 50k values, a second label “b” containing 10k values and a third label “c” containing 1k values. This means that the number of time series, with their corresponding mappings in the index files, will go up to 50k * 10k * 1k = 500 million time series.
This would be a huge problem for the scalability of our single Prometheus instance!
If you are interested in learning more deeply about Prometheus monitoring system, we highly recommend this book: “Prometheus: Up & Running”.
In this article we’ve learned how Prometheus finds data based on the labels and time range specified in a PromQL query. To gather all the necessary data, it uses two kind of indexes: posting and series indexes. Each of them are responsible for labels and series mappings respectively.
We’ve tried to simplify the process slightly to make it more understandable for everyone. This is a complex structure that takes time to digest, so don’t worry if you don’t understand it well the first time you read it, you’ll get there after reviewing it a couple of times!
That’s all from us today, we hope you’ve learned something new today and hopefully enjoyed our journey together through the internals of Prometheus and TSDB.
Please follow us to be notified when new content like this gets published! Thanks for reading us!
You must log in to post a comment.