Magical Graphite

Or, « What I’ve learned about Graphite configuration ».

Last week, I worked on configuring Graphite and had to understand how it stores and aggregates data. So here are a few facts.

Graphite Retention

The way our data will be stored is described in /opt/graphite/conf/storage-schemas.conf. As an example:

[default]
 pattern = .*
 retentions = 1s:30m,1m:1d,5m:2y

This worked great when I was looking at data from the last 30 minutes.
If I was trying to display last hour metrics: nothing.
Drawing null as zero was giving me a horizontal line at the bottom of the graph.

The magic of aggregation

This behaviour comes from the file /opt/graphite/conf/storage-aggregation.conf where we find the following lines:

[99_default_avg]
 pattern = .*
 xFilesFactor = 0.5
 aggregationMethod = average

Our problem comes from xFilesFactor. It means that by default, we need at least 50% of the data to be non-null to store an average value. Think about it.

So here, I’m having a metric every second during 30 minutes. If Graphite doesn’t have something for a given second, the value is set to null. Fine, let’s move forward.
For interval higher than 30 minutes (and lower than a day), Graphite will gather data based on the aggregation configured. So it will average data and set the value null if it has less than 50% usable values (not null).

In our case, Graphite tries to average one minute of data (1m:1d) with the precision of 1s from the first retention rule (1s:30m). To understand why nothing is displayed, consider I’m Collectd is sending data to Graphite. On average, metrics are arriving every 3s. On a one minute interval, we gather 20 values but Graphite is considering 60 values, 40 being null. We only have 33% (0.33) metrics usable which is lower than 50% Graphite is waiting for so the averaged value is set to null.

The art of confusion

Now that we updated our configuration, set xFilesFactor to 0 to be sure, restart carbon-cache, everything should work fine…

But that’s not the case; no change.

In fact, previous configuration is still being used in wsp storage files. We can check it with whisper-info.py.

whisper-info.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp
 
 maxRetention: 63072000
 xFilesFactor: 0.5
 aggregationMethod: average
 fileSize: 2561812

Archive 0
 retention: 1800
 secondsPerPoint: 1
 points: 1800
 size: 21600
 offset: 52
Archive 1
 retention: 86400
 secondsPerPoint: 60
 points: 1440
 size: 17280
 offset: 21652
Archive 2
 retention: 63072000
 secondsPerPoint: 300
 points: 210240
 size: 2522880
 offset: 38932

See, we still have xFilesFactor: 0.5.
If you don’t care about previous data, a good solution is to delete files so that the new parameters will be used (rm -rf /opt/graphite/storage/whisper/collectd/). Maybe it’s a little bit overkill, (but easy and fast).

The other solution consists in using whisper-resize.py to enforce the new configuration.
whisper-resize.py /opt/graphite/storage/whisper/collectd/test-java01/cpu-0/cpu-user.wsp 3s:30m,1m:1d,5m:2y –xFilesFactor=0.1

The above works fine, but this is the other way to configure how many metrics Graphite can keep. It has the format n:i, which means we store a measure every n seconds and we want i points to be stored (computed with interval / n).

Example: 3s:30m
30m = 1800s
1800 / 3 = 600

3:600

So 3s:30m,1m:1d,5m:2y gives us 3:600 60:1440 300:210380.

« An average Gregorian year is 365.2425 days = 52.1775 weeks = 8765.82 hours = 525949.2 minutes = 31556952 seconds (mean solar, not SI). » Wikipedia

Note

Thing to remember concerning storage-schemas.conf (taken from Graphite doc):

« Changing this file will not affect already-created .wsp files. Use whisper-resize.py to change those. »

Graphite: Vérifier la réception de données

Une commande bien utile pour vérifier ce qui transite sur une interface réseau pour un port donné. Par exemple, pour vérifier l’envoi de paquet depuis Riemann vers Graphite en local (interface lo et port 2003):

ngrep port 2003 -d lo

Qui nous donne le résultat suivant:

interface: lo (127.0.0.0/255.0.0.0)
filter: (ip or ip6) and ( port 2003 )
#
T 127.0.0.1:33936 -> 127.0.0.1:2003 [AP]
 test 1.4 1401805746. 
##
T 127.0.0.1:33935 -> 127.0.0.1:2003 [AP]
 test 1.0 1401805747.

Les messages arrivent jusqu’à Graphite et utilisent le bon format, donc pas de problème du côté de l’envoi.