Monday, August 14, 2017

Graphite - a 2017 post to simply use with Shinken monitoring (english version)


Easy Graphite Install & Configuration

simple 2017 method (with usage of docker) <=> connect to Shinken + linux-ssh pack


We assume you have Shinken up ad running. And that now you want to use Graphite.
The method can help you have Graphite up and running with another backend too actually (just skip the Shinken/Nagios part)

Introduction (where the story is)

Using shinken for quite a few years now, I'm happy with it, with the interface (Shinken Thruk advantage compared to Nagios Thruk: we can select many services at once, and perform Ack / Recheck,  .... very handful + have fail over, load balancing with no hassle + many small things)
By default, Shinken has RRD graphs activated. The problem comes when, new metric comes from a server, and the RRD files become all blank, because they need to be migrated to a format with the new metric, or with the new metric format.... result is, you loose all your previous graphs with all the history. I think it happened to everyone using RRD graphs at one point of a configuration change.
Solution to this may exist, but I haven't found something simple enough to spent reasonable amount of time doing it. I chose to use the "RRD next gen", called Graphite.

What people call Graphite is actually more a set of 3 things I'll describe shortly
-CARBON , a daemon who listens on port 2003, and then feeds the database
-WHISPER, a database composed of *.wsp files, 1 for each metric
-GRAPHITE, a web interface to show graphs stored in previous database. It's Python, Cairo, Django ... driven

For installing it, I found many websites, documentations and tutorials, all a bit different, and I was ending constantly with problems likes:
  • having a correct vHost for apache (+ not breaking installed sites + their modules) 
  • having correct Cairo version installed bia PIP
  • Python version mismatch, expected '2.7.2+'
  • mod_wsgi.so: Cannot load mod_wsgi.so into server: libpython2.5.so.1.0: cannot open shared object file: No such file or directory
  • Target WSGI script cannot be loaded as Python module
  • graphite/carbon ImportError: No module named fields
  • Django: IntegrityError: column user_id is not unique
  • mod_wsgi fails when it is asked to read a file
  • ...
Note: I was using Debian 6 when I first tried, then Debian 7, with still some problems.

So, ok, I'm no python expert, nor basic user, but I found the Debian packaging + PIP versioning + dependencies problem just a bit too much for me.
I chose to use a ready to use Carbon-Whisper-Graphite docker container. Simple and working well in seconds (ok, minutes for the first time you use Docker)

Installation (where the tech is)

on Debian 8.8, this works super well:
curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
apt-key fingerprint 0EBFCD88
apt-get install apt-transport-https ca-certificates curl python-software-properties
add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian \
$(lsb_release -cs) stable"
apt-get update && apt-get install docker-ce

Now, you want to download + install the graphite docker container, in just 1 command:
docker run -d\
>  --name graphite\
>  --restart=always\
>  -p 81:80\
>  -p 2003-2004:2003-2004\
>  -p 2023-2024:2023-2024\
>  -p 8125:8125/udp\
>  -p 8126:8126\
>  hopsoft/graphite-statsd
Unable to find image 'hopsoft/graphite-statsd:latest' locally
latest: Pulling from hopsoft/graphite-statsd
Command is in yellow.
Only thing I did, is asked port 80 of the container to be linked to port 81, as local port 80 is already used.
You can fine tune this, by linking local files to some configuration files in the docker. I did not at this point.

2 minutes, and boom, it's up and running.

To "manage" the container, just perform these actions:
docker start graphite
docker restart graphite
docker stop graphite
docker exec -it graphite bash     # gives access to the container as if you were in it (bash session)
docker inspect graphite | grep Source -A 1   # gives you local addresses of some containers files


Shinken Configuration (text + tech)

find it easily in Shinken Read The Doc website. 
You just add graphite modules to the /etc/shinken/broker-master.cfg  (so it will use config file /etc/shinken/modules/graphite.cfg, and send data to Carbon)

/etc/shinken/brokers/broker-master.cfg contains:
...
modules     webui,graphite,livestatus,Syslog
...
/etc/shinken/modules/graphite.cfg =
define module {
module_name     graphite
module_type     graphite_perfdata
host            localhost
port            2003  ; Or 2004 if using use_pickle 1
}

That's it.

You also say you want graphite-ui module in webui.cfg, in case you use standard shinken interface (I don't, I use Thruk).
So for graph links, I use an action_url in my /etc/shinken/templates/generic-host.cfg and in my /etc/shinken/templates/generic-service.cfg (displays a link from Thruk to access graphite data, but does not include the graph actually)

grep action /etc/shinken/templates/generic-host.cf
action_url                      http://my.server.name:81/render?from=-36hours&until=now&width=800&height=450&target=$HOSTNAME$.rta&lineMode=connected&lineWidth=2&tz=Europe/Paris

grep action /etc/shinken/templates/generic-service.cfg
action_url                      http://my.server.name:81/render?from=-36hours&until=now&width=800&height=450&target=$HOSTNAME$.$SERVICEDESC$.*&lineMode=connected&lineWidth=2&title=$HOSTNAME$.$SERVICEDESC$&tz=Europe/Paris

Graphite Configuration (text + tech)

I'm configuring only the whisper database datafiles part, to fit what we need in our Shinken configuration.
So, we perform a check every 15 minutes for standard servers, every 5 minutes for production ones.
We use the linux-ssh shinken pack + some other commands to check specific ports or services.
When trying to configure the whisper datafiles , I ended up saturating my disk space when datafiles where created ( a whisper datafile has a fixed size, set when created), so I tuned settings to have correct size.

Retention:

how much we keep data, and to which precision.


My config to fit my shinken:
root@xxxxxxxxxxxxx :/opt/graphite/conf# grep -v ^$ storage-schemas.conf | grep -v ^#
[default_cpu]
pattern = .*\.cpu.*
retentions = 5m:14d,30m:84d
# archive 0 has 12 * 24 * 14 = 4032 points
# archive 1 has  2 * 24 * 84 = 4032 points
#       total   8064    96KB
[default_stats]
pattern = .*\..*State*s\..*
retentions = 5m:14d,30m:224d
# archive 0 has 12 * 24 *  14 = 4032 points
# archive 1 has  2 * 24 * 224 = 10752 points
#       total   14784   176KB
[default_reboot]
pattern = .*\.Reboot\..*
retentions = 15m:14d,90m:224d,360m:2240d
# archive 0 has 4 * 24 * 14  = 1344 points
# archive 1 has 16 * 224  = 3584 points
# archive 2 has 4 * 2240  = 8960 points
#       total   12544   164KB
[default]
pattern = .*
retentions = 5m:14d,30m:224d,90m:896d,360m:2240d


Aggregation:

When data is old, how do we 'compress / keep' data with a lower resolution, to save space.

xFilesFactor : will tell the daemon the minimum amount of data in % (value from 0 to 1) to have. If we have less than this value, then the lower resolution (next archive) will data will be null too.

aggregationMethod : how to calculate several non null points to the next lower resolution.
average is a good choice for me, but we can choose to keep the maximum, minimum value, or other fun possibilities (see graphite doc for that)
You can use a different aggregation method per metric (again with pattern matching on regex)


My config to fit my shinken:

[default_average]
pattern = .*
xFilesFactor = 0.0
aggregationMethod = average

Note: as whipser datafiles are created with fixed size when metric is inserted, I deleted ALL metrics after having correct configuration above (I could have used whipser-resize.py, but too difficult with many different sizes of database, and no much data to save). In case you need it, you loop like this:
for WSP in $(find /opt/graphite/storage/whisper/ -name *.wsp -type f); do
> whisper-resize.py --xFilesFactor 0.0  --aggregationMethod=average $WSP  \
5m:14d 30m:224d 60m:2240d > /dev/null ; done

Troubleshoot:


Using Whiper-info.py:

root@xxxxxxxxxxxxx:/# whisper-dump.py /opt/graphite/storage/whisper/XXXXX/Disks/__data_used_.wsp | head -50


Meta data:
  aggregation method: average
  max retention: 193536000
  xFilesFactor: 0

Archive 0 info:
  offset: 64
  seconds per point: 300
  points: 4032
  retention: 1209600
  size: 48384

Archive 1 info:
  offset: 48448
  seconds per point: 1800
  points: 10752
  retention: 19353600
  size: 129024

Archive 2 info:
  offset: 177472
  seconds per point: 5400
  points: 14336
  retention: 77414400
  size: 172032

Archive 3 info:
  offset: 349504
  seconds per point: 21600
  points: 8960
  retention: 193536000
  size: 107520

Archive 0 data:
0: 1501174200, 194.87899999999999067767930682748556
1: 0,          0
2: 0,          0
3: 1501175100, 194.87899999999999067767930682748556
4: 0,          0
5: 0,          0
6: 1501176000, 194.87899999999999067767930682748556
7: 0,          0
8: 0,          0
9: 1501176900, 194.87899999999999067767930682748556
10: 0,          0
11: 0,          0
12: 1501177800, 194.87899999999999067767930682748556
13: 0,          0
14: 0,          0
15: 1501178700, 194.87899999999999067767930682748556

we can see we fill only 1/3 of the slots, because this metric is recorded for a non production server, so every 15 minutes, not every 5.

If you need to check what is really put in your whisper files, just use these python scripts (that are available directly in your container, so after the docker exec -ti graphite bash command )

whisper-info.py    XXXXX/Disks/___used_.wsp
whisper-dump.py XXXXX/Disks/___used_.wsp > tmp.tmp
less tmp.tmp 



Links:

http://shinken.readthedocs.io/en/latest/index.html
http://graphite.readthedocs.io/en/latest/index.html
https://github.com/hopsoft/docker-graphite-statsd




Wednesday, November 23, 2016

Oracle 12c Pre install script... almost perfect !



When installing Oracle 12c, on a Oracle Linux server, you can perform most of the prerequisites settings by using a cool script:

yum install oracle-rdbms-server-12cR1-preinstall


Problem is that the script fails when started (really, was out of the box / freshly installed from Oracle repos):
oracle-rdbms-server-12cR1-preinstall-verify
/bin/sed: -e expression #1, char 116: unknown command: `3'
/bin/sed: -e expression #1, char 116: unknown command: `3'

Error above can be rectified with the following change ( use ' instead of  " on line 570 )

cp /usr/bin/oracle-rdbms-server-12cR1-preinstall-verify /usr/bin/oracle-rdbms-server-12cR1-preinstall-verify.bak

vi /usr/bin/oracle-rdbms-server-12cR1-preinstall-verify

diff /usr/bin/oracle-rdbms-server-12cR1-preinstall-verify /usr/bin/oracle-rdbms-server-12cR1-preinstall-verify.bak
<             ${SED} -i /'^#[[:space:]]*$COMMENT'/d ${LIMITSFILE}
---
>             ${SED} -i /"^#[[:space:]]*$COMMENT"/d ${LIMITSFILE}
570c570
<           ${SED} -i /'^#[[:space:]]*$COMMENT'/d ${LIMITSFILE}
---
>           ${SED} -i /"^#[[:space:]]*$COMMENT"/d ${LIMITSFILE}


Then you can use the script that saves you time for settings a lot of parameters needed for prerequisites.


Red Hat RHEL7 - NFS service is down - reboot needed




Problem:
NFS service is down

Logs: 
[16-11-23-10:06:00] [root@servername - ~]
cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)


[16-11-23-10:06:02] [root@servername - ~]
systemctl status nfs-server.service
● nfs-server.service - NFS server and services
   Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Wed 2016-11-23 10:06:02 CET; 7s ago
  Process: 11699 ExecStart=/usr/sbin/rpc.nfsd $RPCNFSDARGS (code=exited, status=1/FAILURE)
  Process: 11697 ExecStartPre=/usr/sbin/exportfs -r (code=exited, status=0/SUCCESS)
 Main PID: 11699 (code=exited, status=1/FAILURE)

Nov 23 10:06:02 servernamesystemd[1]: Starting NFS server and services...
Nov 23 10:06:02 sservername rpc.nfsd[11699]: rpc.nfsd: writing fd to kernel failed: errno 111 (Connection refused)
Nov 23 10:06:02 servername rpc.nfsd[11699]: rpc.nfsd: unable to set any sockets for nfsd
Nov 23 10:06:02 servername systemd[1]: nfs-server.service: main process exited, code=exited, status=1/FAILURE
Nov 23 10:06:02 servername systemd[1]: Failed to start NFS server and services.
Nov 23 10:06:02 servernamesystemd[1]: Unit nfs-server.service entered failed state.
Nov 23 10:06:02 servernamesystemd[1]: nfs-server.service failed.


So checking the rpcbind service:

[16-11-23-10:07:18] [root@servername - ~]
systemctl status rpcbind.service start
● rpcbind.service - RPC bind service
   Loaded: loaded (/usr/lib/systemd/system/rpcbind.service; indirect; vendor preset: enabled)
   Active: failed (Result: start-limit) since Wed 2016-11-23 05:31:58 CET; 4h 35min ago
 Main PID: 1278 (code=exited, status=0/SUCCESS)

Nov 23 05:31:58 servername systemd[1]: Starting RPC bind service...
Nov 23 05:31:58 servername rpcbind[20677]: /sbin/rpcbind: symbol lookup error: /sbin/rpcbind: undefined symbol: libtirpc_set_debug
Nov 23 05:31:58 servername systemd[1]: rpcbind.service: control process exited, code=exited status=127
Nov 23 05:31:58 servername systemd[1]: Failed to start RPC bind service.
Nov 23 05:31:58 servername systemd[1]: Unit rpcbind.service entered failed state.
Nov 23 05:31:58 servername systemd[1]: rpcbind.service failed.
Nov 23 05:31:58 servername systemd[1]: start request repeated too quickly for rpcbind.service
Nov 23 05:31:58 servername systemd[1]: Failed to start RPC bind service.
Nov 23 05:31:58 servername systemd[1]: rpcbind.service failed.



Solution:
actually, a quite big set of update was applied from Red Hat Satellite, and server needed a reboot (even if it was not advertised by Satellite updates summary).
shutdown -r now
As I did not find any related information for the errors above, I quickly write this :)


Friday, February 12, 2016

Visual Studio 2015 - and GNU Make : Path is truncated, while PATH is complete!


Quick view:

Not sure why, but I have 2 path variables: Path and PATH.
PATH is complete and shows no problem.
Path is trimmed after ...Common7\IDE because of the trailing\


Details:
Just a quick note for who may encounter this problem with your bat files when you used to build your program under Visual Studio 2003 2005 2008 or 2010, and start using Visual Studio 2015.

I use a bat script with these 2 lines:
call "%VS140COMNTOOLS%\vsvars32.bat"
make bin

I call GNU make , a windows binary version of it (installed by Unix For Windows: C:\Program Files\unixforwindows\usr\local\wbin\make.exe )

It shows this environment from the Windows command "set" I've put in the Makefile:

Path=C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86
)\MSBuild\14.0\bin;C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE(STOPS HERE)

PATH=C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\CommonExtensions\Microsoft\TestWindow;C:\Program Files (x86
)\MSBuild\14.0\bin;C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\IDE\;C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin;C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\Tools;C:\Windows\Microsoft.NET\Framework\v4.0.30319;C
:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\VCPackages;etc...


You can see the Path (used by GNU Make) is truncated. PATH (used by other windows tools) is ok.

Problem is this line in :%VS140COMNTOOLS%\vsvars32.bat
@set DevEnvDir=%VSINSTALLDIR%Common7\IDE\


I needed to remove the trailing \ :
@set DevEnvDir=%VSINSTALLDIR%Common7\IDE



Visual Studio 2015 - Win32 release is now wrong / Release changed from Win32 to x86




When this used to work just fine with Microsoft Visual Studio 2003, 2005, 2008 or 2010:
call "%VS90COMNTOOLS%\vsvars32.bat"
devenv xxx.vcxproj /Build "Release|win32"
devenv xxx.vcxproj /Build "Release|x64"

I needed to have this in my script with Microsoft Visual Studio 2015
call "%VS140COMNTOOLS%\vsvars32.bat"
devenv xxx.vcxproj /Build "Release|x86"
devenv xxx.vcxproj /Build "Release|x64"


It's a quick post, if you have question, please post a comment! 

Tuesday, December 16, 2014

Solaris 10: an LDAP user can login, but can't use SSH: You don't exist, go away!


As said in the title, I've encountered
Solaris 10: an LDAP user can login, but can't use SSH: You don't exist, go away!


Symptom: 
[14-12-03 - 14:09:54 on servername]  ~$ id -auid=1201 gid=1200 groups=1200,1300
[14-12-03 - 14:08:13 on servername]  ~$ ssh -vv
You don't exist, go away!

Cause:
users (not in root or sys group) can't access information from LDAP Server

Solution:
as root:

 
[14-12-03 - 14:40:42 on servername] root /etc
$ chmod o+r /var/ldap/*
[14-12-03 - 14:42:05 on servername] root /etc
$ ll /var/ldap/*
-rw-r--r--   1 Template root         10K Dec  3 13:45 /var/ldap/cachemgr.log
-rw----r--   1 Template root         64K Aug  5  2013 /var/ldap/cert8.db
-rw----r--   1 Template root         32K Aug  5  2013 /var/ldap/key3.db
-r-----r--   1 Template root         222 Dec  3 13:38 /var/ldap/ldap_client_cred
-r-----r--   1 Template root         478 Dec  3 13:38 /var/ldap/ldap_client_file
-rw----r--   1 Template root         32K Aug  5  2013 /var/ldap/secmod.db

then "id" commands show correct results coming from LDAP server:
[14-12-03 - 14:14:15 on servername] alex ~
$ id -a
uid=1201(alex) gid=1200(admin) groups=1200(admin),1300(support)


It's a quick post, please leave your questions in the comments! 

Tuesday, April 8, 2014

Quick how to: extend a linux encrypted partition



You have a linux virtual machine with an encrypted hard drive. How to quickly extend it ?


Environment (this method may apply to other environments):
  • Virtualization type: 
    • VMWare ESX
  • Linux:
    • RHEL 5.7
    • using main root partition as an LVM encrypted with LUKS
      / in /dev/sda2
    • using a boot partition /boot in /dev/sda1

Quick method:
  • give more space from VMWare
    •  edit settings / Hardware / Hard Disk / change the value in "provisioned size"
      (it is usually grayed out when you already have a snapshot)
    • create a snapshot (to roll back in case of problem)
  • boot a live CD
    • from VMWare, Edit Settings / CD DVD adapter / load ISO (I used a debian7 cd image) + connected at poweron
    • edit bios setting to boot on cd
    • Boot on virtual CD: for my Debian 7, I booted in ExpertMode / Rescue Disk 
  • From the live cd console
    • extend physical parition
      fdisk /dev/sda
      sequence is: d 2 n p 2 t 2 8e w

      (sequence meaning: delete partition2, new partiiton primary, number2, change type of parititon2 to LVM (8e) )
    • Open CRYPT
      cryptsetup luksOpen /dev/sda2 crypt1
    • extend  CRYPT:
      cryptsetup resize crypt1
    • entend PV:
      pvdisplay /dev/mapper/crypt1
      pvresize /dev/mapper/crypt1
      pvdisplay /dev/mapper/crypt1
    • entend LV:
      lvdisplpay
      lvresize -L +30G /dev/VolGroup00/LogVol00
      lvdisplay
  • reboot your server as usual
    • extend filesystem:
      resize2fs  -p /dev/mapper/VolGroup00-LogVol00
    • check new available size:
      df -h
That's it

Notes:
this is a quick (and dirty) how to.
It does not cover good practice like writing random data in the disk space we merge with our LUKS partition. It uses a live cd to avoid lock and root unmount problems.
I hope it will be useful for some readers ! Comments welcome