Magical Graphite

Or, « What I’ve learned about Graphite configuration ».

Last week, I worked on configuring Graphite and had to understand how it stores and aggregates data. So here are a few facts.

Graphite Retention

The way our data will be stored is described in /opt/graphite/conf/storage-schemas.conf. As an example:

[default]
 pattern = .*
 retentions = 1s:30m,1m:1d,5m:2y

This worked great when I was looking at data from the last 30 minutes.
If I was trying to display last hour metrics: nothing.
Drawing null as zero was giving me a horizontal line at the bottom of the graph.

The magic of aggregation

This behaviour comes from the file /opt/graphite/conf/storage-aggregation.conf where we find the following lines:

[99_default_avg]
 pattern = .*
 xFilesFactor = 0.5
 aggregationMethod = average

Our problem comes from xFilesFactor. It means that by default, we need at least 50% of the data to be non-null to store an average value. Think about it.

So here, I’m having a metric every second during 30 minutes. If Graphite doesn’t have something for a given second, the value is set to null. Fine, let’s move forward.
For interval higher than 30 minutes (and lower than a day), Graphite will gather data based on the aggregation configured. So it will average data and set the value null if it has less than 50% usable values (not null).

In our case, Graphite tries to average one minute of data (1m:1d) with the precision of 1s from the first retention rule (1s:30m). To understand why nothing is displayed, consider I’m Collectd is sending data to Graphite. On average, metrics are arriving every 3s. On a one minute interval, we gather 20 values but Graphite is considering 60 values, 40 being null. We only have 33% (0.33) metrics usable which is lower than 50% Graphite is waiting for so the averaged value is set to null.

The art of confusion

Now that we updated our configuration, set xFilesFactor to 0 to be sure, restart carbon-cache, everything should work fine…

But that’s not the case; no change.

In fact, previous configuration is still being used in wsp storage files. We can check it with whisper-info.py.

whisper-info.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp
 
 maxRetention: 63072000
 xFilesFactor: 0.5
 aggregationMethod: average
 fileSize: 2561812

Archive 0
 retention: 1800
 secondsPerPoint: 1
 points: 1800
 size: 21600
 offset: 52
Archive 1
 retention: 86400
 secondsPerPoint: 60
 points: 1440
 size: 17280
 offset: 21652
Archive 2
 retention: 63072000
 secondsPerPoint: 300
 points: 210240
 size: 2522880
 offset: 38932

See, we still have xFilesFactor: 0.5.
If you don’t care about previous data, a good solution is to delete files so that the new parameters will be used (rm -rf /opt/graphite/storage/whisper/collectd/). Maybe it’s a little bit overkill, (but easy and fast).

The other solution consists in using whisper-resize.py to enforce the new configuration.
whisper-resize.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp 3s:30m,1m:1d,5m:2y –xFilesFactor=0.1

The above works fine, but this is the other way to configure how many metrics Graphite can keep. It has the format n:i, which means we store a measure every n seconds and we want i points to be stored (computed with interval / n).

Example: 3s:30m
30m = 1800s
1800 / 3 = 600

3:600

So 3s:30m,1m:1d,5m:2y gives us 3:600 60:1440 300:210380.

« An average Gregorian year is 365.2425 days = 52.1775 weeks = 8765.82 hours = 525949.2 minutes = 31556952 seconds (mean solar, not SI). » Wikipedia

Note

Thing to remember concerning storage-schemas.conf (taken from Graphite doc):

« Changing this file will not affect already-created .wsp files. Use whisper-resize.py to change those. »