Monday, August 14, 2017

Graphite - a 2017 post to simply use with Shinken monitoring (english version)


Easy Graphite Install & Configuration

simple 2017 method (with usage of docker) <=> connect to Shinken + linux-ssh pack


We assume you have Shinken up ad running. And that now you want to use Graphite.
The method can help you have Graphite up and running with another backend too actually (just skip the Shinken/Nagios part)

Introduction (where the story is)

Using shinken for quite a few years now, I'm happy with it, with the interface (Shinken Thruk advantage compared to Nagios Thruk: we can select many services at once, and perform Ack / Recheck,  .... very handful + have fail over, load balancing with no hassle + many small things)
By default, Shinken has RRD graphs activated. The problem comes when, new metric comes from a server, and the RRD files become all blank, because they need to be migrated to a format with the new metric, or with the new metric format.... result is, you loose all your previous graphs with all the history. I think it happened to everyone using RRD graphs at one point of a configuration change.
Solution to this may exist, but I haven't found something simple enough to spent reasonable amount of time doing it. I chose to use the "RRD next gen", called Graphite.

What people call Graphite is actually more a set of 3 things I'll describe shortly
-CARBON , a daemon who listens on port 2003, and then feeds the database
-WHISPER, a database composed of *.wsp files, 1 for each metric
-GRAPHITE, a web interface to show graphs stored in previous database. It's Python, Cairo, Django ... driven

For installing it, I found many websites, documentations and tutorials, all a bit different, and I was ending constantly with problems likes:
  • having a correct vHost for apache (+ not breaking installed sites + their modules) 
  • having correct Cairo version installed bia PIP
  • Python version mismatch, expected '2.7.2+'
  • mod_wsgi.so: Cannot load mod_wsgi.so into server: libpython2.5.so.1.0: cannot open shared object file: No such file or directory
  • Target WSGI script cannot be loaded as Python module
  • graphite/carbon ImportError: No module named fields
  • Django: IntegrityError: column user_id is not unique
  • mod_wsgi fails when it is asked to read a file
  • ...
Note: I was using Debian 6 when I first tried, then Debian 7, with still some problems.

So, ok, I'm no python expert, nor basic user, but I found the Debian packaging + PIP versioning + dependencies problem just a bit too much for me.
I chose to use a ready to use Carbon-Whisper-Graphite docker container. Simple and working well in seconds (ok, minutes for the first time you use Docker)

Installation (where the tech is)

on Debian 8.8, this works super well:
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
apt-key fingerprint 0EBFCD88
apt-get install apt-transport-https ca-certificates curl python-software-properties
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable"
apt-get update && apt-get install docker-ce

Now, you want to download + install the graphite docker container, in just 1 command:
docker run -d\
>  --name graphite\
>  --restart=always\
>  -p 81:80\
>  -p 2003-2004:2003-2004\
>  -p 2023-2024:2023-2024\
>  -p 8125:8125/udp\
>  -p 8126:8126\
>  hopsoft/graphite-statsd
Unable to find image 'hopsoft/graphite-statsd:latest' locally
latest: Pulling from hopsoft/graphite-statsd
Command is in yellow.
Only thing I did, is asked port 80 of the container to be linked to port 81, as local port 80 is already used.
You can fine tune this, by linking local files to some configuration files in the docker. I did not at this point.

2 minutes, and boom, it's up and running.

To "manage" the container, just perform these actions:
docker start graphite
docker restart graphite
docker stop graphite
docker exec -it graphite bash     # gives access to the container as if you were in it (bash session)
docker inspect graphite | grep Source -A 1   # gives you local addresses of some containers files


Shinken Configuration (text + tech)

find it easily in Shinken Read The Doc website. 
You just add graphite modules to the /etc/shinken/broker-master.cfg  (so it will use config file /etc/shinken/modules/graphite.cfg, and send data to Carbon)

/etc/shinken/brokers/broker-master.cfg contains:
...
modules     webui,graphite,livestatus,Syslog
...
/etc/shinken/modules/graphite.cfg =
define module {
module_name     graphite
module_type     graphite_perfdata
host            localhost
port            2003  ; Or 2004 if using use_pickle 1
}

That's it.

You also say you want graphite-ui module in webui.cfg, in case you use standard shinken interface (I don't, I use Thruk).
So for graph links, I use an action_url in my /etc/shinken/templates/generic-host.cfg and in my /etc/shinken/templates/generic-service.cfg (displays a link from Thruk to access graphite data, but does not include the graph actually)

grep action /etc/shinken/templates/generic-host.cf
action_url                      http://my.server.name:81/render?from=-36hours&until=now&width=800&height=450&target=$HOSTNAME$.rta&lineMode=connected&lineWidth=2&tz=Europe/Paris

grep action /etc/shinken/templates/generic-service.cfg
action_url                      http://my.server.name:81/render?from=-36hours&until=now&width=800&height=450&target=$HOSTNAME$.$SERVICEDESC$.*&lineMode=connected&lineWidth=2&title=$HOSTNAME$.$SERVICEDESC$&tz=Europe/Paris

Graphite Configuration (text + tech)

I'm configuring only the whisper database datafiles part, to fit what we need in our Shinken configuration.
So, we perform a check every 15 minutes for standard servers, every 5 minutes for production ones.
We use the linux-ssh shinken pack + some other commands to check specific ports or services.
When trying to configure the whisper datafiles , I ended up saturating my disk space when datafiles where created ( a whisper datafile has a fixed size, set when created), so I tuned settings to have correct size.

Retention:

how much we keep data, and to which precision.


My config to fit my shinken:
root@xxxxxxxxxxxxx :/opt/graphite/conf# grep -v ^$ storage-schemas.conf | grep -v ^#
[default_cpu]
pattern = .*\.cpu.*
retentions = 5m:14d,30m:84d
# archive 0 has 12 * 24 * 14 = 4032 points
# archive 1 has  2 * 24 * 84 = 4032 points
#       total   8064    96KB
[default_stats]
pattern = .*\..*State*s\..*
retentions = 5m:14d,30m:224d
# archive 0 has 12 * 24 *  14 = 4032 points
# archive 1 has  2 * 24 * 224 = 10752 points
#       total   14784   176KB
[default_reboot]
pattern = .*\.Reboot\..*
retentions = 15m:14d,90m:224d,360m:2240d
# archive 0 has 4 * 24 * 14  = 1344 points
# archive 1 has 16 * 224  = 3584 points
# archive 2 has 4 * 2240  = 8960 points
#       total   12544   164KB
[default]
pattern = .*
retentions = 5m:14d,30m:224d,90m:896d,360m:2240d


Aggregation:

When data is old, how do we 'compress / keep' data with a lower resolution, to save space.

xFilesFactor : will tell the daemon the minimum amount of data in % (value from 0 to 1) to have. If we have less than this value, then the lower resolution (next archive) will data will be null too.

aggregationMethod : how to calculate several non null points to the next lower resolution.
average is a good choice for me, but we can choose to keep the maximum, minimum value, or other fun possibilities (see graphite doc for that)
You can use a different aggregation method per metric (again with pattern matching on regex)


My config to fit my shinken:

[default_average]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

Note: as whipser datafiles are created with fixed size when metric is inserted, I deleted ALL metrics after having correct configuration above (I could have used whipser-resize.py, but too difficult with many different sizes of database, and no much data to save). In case you need it, you loop like this:
for WSP in $(find /opt/graphite/storage/whisper/ -name *.wsp -type f); do
> whisper-resize.py --xFilesFactor 0.0  --aggregationMethod=average $WSP  \
5m:14d 30m:224d 60m:2240d > /dev/null ; done

Troubleshoot:


Using Whiper-info.py:

root@xxxxxxxxxxxxx:/# whisper-dump.py /opt/graphite/storage/whisper/XXXXX/Disks/__data_used_.wsp | head -50


Meta data:
  aggregation method: average
  max retention: 193536000
  xFilesFactor: 0

Archive 0 info:
  offset: 64
  seconds per point: 300
  points: 4032
  retention: 1209600
  size: 48384

Archive 1 info:
  offset: 48448
  seconds per point: 1800
  points: 10752
  retention: 19353600
  size: 129024

Archive 2 info:
  offset: 177472
  seconds per point: 5400
  points: 14336
  retention: 77414400
  size: 172032

Archive 3 info:
  offset: 349504
  seconds per point: 21600
  points: 8960
  retention: 193536000
  size: 107520

Archive 0 data:
0: 1501174200, 194.87899999999999067767930682748556
1: 0,          0
2: 0,          0
3: 1501175100, 194.87899999999999067767930682748556
4: 0,          0
5: 0,          0
6: 1501176000, 194.87899999999999067767930682748556
7: 0,          0
8: 0,          0
9: 1501176900, 194.87899999999999067767930682748556
10: 0,          0
11: 0,          0
12: 1501177800, 194.87899999999999067767930682748556
13: 0,          0
14: 0,          0
15: 1501178700, 194.87899999999999067767930682748556

we can see we fill only 1/3 of the slots, because this metric is recorded for a non production server, so every 15 minutes, not every 5.

If you need to check what is really put in your whisper files, just use these python scripts (that are available directly in your container, so after the docker exec -ti graphite bash command )

whisper-info.py    XXXXX/Disks/___used_.wsp
whisper-dump.py XXXXX/Disks/___used_.wsp > tmp.tmp
less tmp.tmp 



Links:

http://shinken.readthedocs.io/en/latest/index.html
http://graphite.readthedocs.io/en/latest/index.html
https://github.com/hopsoft/docker-graphite-statsd