I am moving, again!! to GitHub pages

Hello folks, I am moving to GitHub Pages. I will continue to write there instead of here.

Please visit my new home: http://soumyadipdm.github.io/

Posted in Uncategorized | Leave a comment

CFEngine Internals: How cf-agent executes a policy?

1. Start GDB
[root@infra01 unixguy]# gdb /var/cfengine/bin/cf-agent
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-64.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /var/cfengine/bin/cf-agent...done.

2. show the entry point of the program
(gdb) list
207 "Enable colorized output. Possible values: 'always', 'auto', 'never'. If option is used, the default value is 'auto'",
208 "Disable extension loading (used while upgrading)",
209 "Log timestamps on each line of log output",
210 NULL
211 };
212
213 /*******************************************************************/
214
215 int main(int argc, char *argv[])
216 {

3. put a break point at main() function
(gdb) break main
Breakpoint 1 at 0x409f30: file cf-agent.c, line 216.

4. run the program with arguments

(gdb) run -IKC -f ./test.cf
Starting program: /var/cfengine/bin/cf-agent -IKC -f ./test.cf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, main (argc=4, argv=0x7fffffffe658) at cf-agent.c:216
216 {

5. Bird's eye-view analysis of overall execution of cf-agent:

(gdb) next
220 struct timespec start = BeginMeasure(); ---> BeginMeasure() is in libpromises/instrumentation.c it uses clock_gettime() which in turn uses time() of time.h to set the start time of the execution. This time is in seconds from epoch
(gdb)
222 GenericAgentConfig *config = CheckOpts(argc, argv); ---> CheckOpts() parses command line arguments of cf-agent
(gdb)
223 EvalContext *ctx = EvalContextNew(); ---> EvalContextNew() is defined in libpromises/eval_context.c. It initializes EvalContext structure with data like memory allocation for global classes (ClassTable structures in
a RedBlack tree), global variables, uid/gid/pid of the process etc. among
others. In other words, it saves current execution context.
(gdb)
224 GenericAgentConfigApply(ctx, config); ---> sets soft classes
defined via -D option, these classes are set with "default namespace" scope i.e they
are set until the agent exits.
(gdb)
223 EvalContext *ctx = EvalContextNew();
(gdb)
224 GenericAgentConfigApply(ctx, config);
(gdb)
226 GenericAgentDiscoverContext(ctx, config); ---> defined in
libpromises/generic_agent.c, it discovers hard classes using functions like
DetectEnvironment() defined in libenv/sysinfo.c
(gdb)
Detaching after fork from child process 15648.
228 Policy *policy = SelectAndLoadPolicy(config, ctx, ALWAYS_VALIDATE,
true); ---> defined in libpromises/generic_agent.c, checks validity of the
policies and writes /var/cfengine/state/cf_promises_validated with the
current timestamp on success
(gdb)
Detaching after fork from child process 15650.
230 if (!policy)
(gdb)
228 Policy *policy = SelectAndLoadPolicy(config, ctx, ALWAYS_VALIDATE, true);
(gdb)
230 if (!policy)
(gdb)
236 GenericAgentPostLoadInit(ctx); ---> initializes a TLS client
(gdb)
237 ThisAgentInit(); ---> sets umask to 077
(gdb)
239 BeginAudit(); ---> sets END_AUDIT_REQUIRED to true
(gdb)
240 KeepPromises(ctx, policy, config); ---> evaluates control
promises and then evaluates bundles promises. KeepPromiseBundles() is used
for evaluating bundle promises one at a time
(gdb)
222 GenericAgentConfig *config = CheckOpts(argc, argv);
(gdb)
240 KeepPromises(ctx, policy, config);
(gdb)
R: this should not happen
242 if (ALLCLASSESREPORT) ---> if configured, writes to
allclasses.txt with names of classes from the ClassTable
(gdb)
247 Nova_TrackExecution(config->input_file);
(gdb)
250 UpdatePackagesCache(ctx, false); ---> updates cache of installed
packages
(gdb)
252 GenerateReports(config, ctx);
(gdb)
254 PurgeLocks();
(gdb)
255 BackupLockDatabase();
(gdb)
257 PolicyDestroy(policy); /* Can we safely do this earlier ? */
(gdb)
259 if (config->agent_specific.agent.bootstrap_policy_server && !VerifyBootstrap())
(gdb)
258 int ret = 0;
(gdb)
266 EndAudit(ctx, CFA_BACKGROUND);
(gdb)
268 Nova_NoteAgentExecutionPerformance(config->input_file, start);
(gdb)
270 GenericAgentFinalize(ctx, config); ---> destroys TLS channel,
frees datastructures like config and EvalContext
(gdb)
277 }
(gdb)

Posted in Uncategorized | Leave a comment

Who’s killing that process? Who’s dumping prelink files in /tmp? — Linux auditd to the rescue

So on the other day at work we had some interesting issues.

  1. People started complaining that config management tool was killing application processes.
  2. Few days before that we had people complaining about /tmp getting filled up due to mysterious prelink files, and blame was on RHEL 6. Somebody said it was due to a bug (https://bugzilla.redhat.com/show_bug.cgi?id=584550). Hmm!! Somebody knows how to get on to StackExchange🙂

Anyways, obviously in both the cases the app owners did not make sense. Because if it was problem with config management tool, or bug in the OS we would see global issues, not on few servers.

So how do you tackle/RCA such issues?

Well, Linux already has something awesome for us. Linux audit subsystem can help a lot in dissecting issue like the above ones.

For issue 1, I enabled audit on `kill` system call, filtering it further with SIGKILL and SIGTERM. You can do this by issuing the following commands:

auditctl -a always,exit -F arch=b64 -F a1=15 -S kill -k log_kill
auditctl -a always,exit -F arch=b64 -F a1=9 -S kill -k log_kill

auditctl(8) man page has a very good run down on the arguments. Here’s a short description of the options of the command above:

-a to append rules after the existing ones. “always” stands for “log it as audit even always”. “exit” stands for “event should be logged under exit list after exit of system call being audited”

-F is filter. Multiple -F are combined with an implicit AND. So here I am looking for 64bit system calls, a0 through a3 are 4 arguments to the system call. Here I am only interested in SIGTERM and SIGKILL signal. Looking at kill(2) man page, I figured that second argument takes signal name/number, so in this case that would be a1=15. kill(2) says the first argument is the PID, so if I really wanted to narrow down, I could even specified -F a0=12646 where 12646 is the PID of the process being audited.

-S takes system call name or number, you can have multiple of them if you want to audit more than one system call. /usr/include/asm/unistd_64.h defines all 320 system calls. You may want to refer that if do not know the system call name or number already. You have to install “kernel-headers” package for your kernel version to get access to the header file.

-k takes a word which can be referred as key. It makes searching logs super convenient, you can even delete the rule using the same key.

Awesome, now that audit rule is set up, all we had to do was to wait for the next incident when application process gets killed.

Here’s a demo of what I did after the app got killed. To demonstrate, I am using a script which sleeps 5 minutes and then wake up only to sleep for another 5 minutes. I pushed it to the background and issued a kill -9 on the PID

Capture

Now let’s see what the audit log has to say about the event. Audit logs are written in /var/log/audit/audit.log in raw format. I used ausearch(8) to filter result, plus it gives really human friendly output.

So I used below command to check the logs.

ausearch -i -k log_kill

-i interprets uid, gid, syscalls, arguments etc. into human readable words by reading /etc/password etc. It really helps, without this option, you get numeric/hexadecimal equivalents.

-k filter the search with the key we created before

And it showed me this:

Capture

—-
type=OBJ_PID msg=audit(01/23/2016 21:09:50.413:174) : opid=12646 oauid=unixguy ouid=unixguy oses=3 ocomm=script_kill_me.
type=SYSCALL msg=audit(01/23/2016 21:09:50.413:174) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x3166 a1=SIGKILL a2=0x0 a3=0x3166 items=0 ppid=1338 pid=1339 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=bash exe=/usr/bin/bash key=log_kill

The log is pretty clear (you just have to concentrate on bold parts of it) on which process sent SIGKILL (exe=) to the process being audited (opid=)

And boy, did we know who was killing the app process!! It was certainly not the config management tool.🙂

Now it’s time for issue #2, using the same auditd we can monitor files, directories. The mysterious prelink files are actually generated using /usr/sbin/prelink program. As per prelink(8) man page:

prelink is a program that modifies ELF shared libraries and ELF dynami‐
cally linked binaries in such a way that the time needed for the
dynamic linker to perform relocations at startup significantly
decreases. Due to fewer relocations, the run-time memory consumption
decreases as well (especially the number of unshareable pages). The
prelinking information is only used at startup time if none of the
dependent libraries have changed since prelinking; otherwise programs
are relocated normally.

So someone/some program/script was calling prelink to create prelink files. If we just enabled audit on /tmp/*prelink* files, it would not have given us much. It would only show us that at some point /usr/sbin/prelink was called to create/write/change attribute of the prelink files. But I also want to know who executed /usr/sbin/prelink at the same time. That would give us solid proof on who’s that devil that filling /tmp with junk.

So first let’s enable audit on the mysterious file:

auditctl -w /tmp/i_am_a_mysterious_file -p wa -k mysterious_file

“-p wa” says that I am only interested in “write” and “change of attribute” events of the mysterious file.

And then enable audit on the /usr/sbin/prelink, for demo I am using /usr/bin/touch

auditctl -w /usr/bin/touch -p x -k mysterious_touch

“-p x” here says that I am only interested in execution event of the file in question.

So after a while I execute below to see when the mysterious file was written into or attributes were changed.

ausearch -i -k mysterious_file

—-
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=2 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=1 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=0 name=/tmp/ inode=25165953 dev=fd:00 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 objtype=PARENT
type=CWD msg=audit(01/23/2016 23:18:06.577:418) : cwd=/home/unixguy
type=SYSCALL msg=audit(01/23/2016 23:18:06.577:418) : arch=x86_64 syscall=open success=yes exit=3 a0=0x7ffff6b47404 a1=O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK a2=0666 a3=0x7ffff6b45c80 items=3 ppid=16652 pid=16655 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_file
—-
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=2 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=1 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=0 name=/tmp/ inode=25165953 dev=fd:00 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 objtype=PARENT
type=CWD msg=audit(01/23/2016 23:18:20.265:420) : cwd=/home/unixguy
type=SYSCALL msg=audit(01/23/2016 23:18:20.265:420) : arch=x86_64 syscall=open success=yes exit=3 a0=0x7fffb6fa2404 a1=O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK a2=0666 a3=0x7fffb6fa08a0 items=3 ppid=16652 pid=16658 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_file

From the logs, I take note of the start and end time of the operation, as well as uid of the process.

Then I issue below, filtering the timestamp I got from above:

[unixguy@app01 ~]$ sudo ausearch -i -k mysterious_touch -ts 23:18:06 -te 23:18:21
—-
type=PATH msg=audit(01/23/2016 23:18:06.576:417) : item=1 name=(null) inode=16879113 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.576:417) : item=0 name=/usr/bin/touch inode=8587749 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=CWD msg=audit(01/23/2016 23:18:06.576:417) : cwd=/home/unixguy
type=EXECVE msg=audit(01/23/2016 23:18:06.576:417) : argc=2 a0=/usr/bin/touch a1=/tmp/i_am_a_mysterious_file
type=SYSCALL msg=audit(01/23/2016 23:18:06.576:417) : arch=x86_64 syscall=execve success=yes exit=0 a0=0x1a92fd0 a1=0x1a96a10 a2=0x19852a0 a3=0x7fffb7234d50 items=2 ppid=16652 pid=16655 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_touch
—-
type=PATH msg=audit(01/23/2016 23:18:20.264:419) : item=1 name=(null) inode=16879113 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.264:419) : item=0 name=/usr/bin/touch inode=8587749 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=CWD msg=audit(01/23/2016 23:18:20.264:419) : cwd=/home/unixguy
type=EXECVE msg=audit(01/23/2016 23:18:20.264:419) : argc=2 a0=/usr/bin/touch a1=/tmp/i_am_a_mysterious_file
type=SYSCALL msg=audit(01/23/2016 23:18:20.264:419) : arch=x86_64 syscall=execve success=yes exit=0 a0=0x1a96a10 a1=0x1a9c340 a2=0x19852a0 a3=0x7fffb7234d50 items=2 ppid=16652 pid=16658 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_touch

The key parts of the logs are in bold face. I double confirmed uid, gid, cwd with the previous log. Inode 16879113 turned out to be of linux loader: /usr/lib64/ld-2.17.so. This is because, every time you execute an ELF binary, the loader has to read ELF and load the segments like data etc. into memory (I will write up another blog on this one, maybe later). So even though we could not pinpoint what exact process was executing /usr/sbin/prelink, we know the PWD of the process and UID/GID. In our case, it turned out to be a non-privileged UID used for running apps.

So at the end we could prove that it was not OS bug, rather the app was executing prelink and in turn filling /tmp up.

Yep, all because of auditd!!!

Posted in Uncategorized | Leave a comment

misc-fixed-width-6×13 font with dotted zero

So I have been experimenting with fixed-width bitmap fonts for a while, in my quest for the “perfect” bitmap font for programming and terminal app.

What better bitmap font can you have than misc-fixed-width-6×13 distributed with X11. It’s undeniably crisp and easy on eyes, really awesome for programming.

If you spend more than 10 hours coding or working on terminal, you would need something which is easy on eyes yet precise. Bitmap fonts are known to be precise and crisp. I have used ProFont, Terminus, Proggy etc. But really fell in love with the fixed-6×13 font.

The only thing that was kinda in my wish-list was the “dotted-zero”. So I began experimenting with FontForge. Generating .dfont for Mac was easy. But somehow Windows 10 did not like the .fon file generated by FontForge.

So I created .fon file from original 6×13.bdf using FontForge, and then regenerated .fon file using Simon Tatham’s scripts available here: http://www.chiark.greenend.org.uk/~sgtatham/fonts/

The end result is really cool:

The modified fonts are available in my GitHub repo: https://github.com/soumyadipdm/Fixed6x13-dotted-zero

Posted in Uncategorized | Leave a comment

Organize your movies per IMDB genre

I have a decent collection of movies that I watch when time permits. Sometimes you would like to watch action movies, sometimes horror, at times the animated ones, all depends upon what mood dictates.

This was ok. But problem was, being a lazy person, I never bothered to actually organize the movies according to their genres, and of course I never intended to do that manually.

So I wrote a little python code which did what I wanted: It organized my movies collection per the genres obtained from IMDB. All thanks to the imdbpy module which can search IMDB database for movies and show you whole lot of attributes of the movie in question. Exactly what I needed.

All left to me was to just use this module and list the crappy directory where I dumped all my movies indifferently, get the list of movies, search them and get their genres one by one, create a directory with the name of the genre, and move the movie to that directory. Voila!!

So the end result is pretty neat looking movie library:


├── Adventure
│   ├── Divergent\ (2014)
│   │   ├── Divergent.mp4
│   ├── Jaws\ (1975)
│   │   ├── Jaws.mp4
│   ├── The\ Hobbit\ The\ Battle\ of\ the\ Five\ Armies\ (2014)
│   │   ├── The.Hobbit.The.Battle.of.the.Five.Armies.2014.avi
│   ├── The\ Hobbit\ The\ Desolation\ of\ Smaug\ (2013)\ [1080p]
│   │   ├── The.Hobbit.The.Desolation.of.Smaug.2013.1080p.BluRay.mp4
│   ├── The\ Sign\ of\ Four\ (1987)
│   │   ├── The.Sign.of.Four.1987.720p.BluRay.mp4
│   └── Zombieland\ (2009)
│   ├── Zombieland.2009.720p.BrRip.mp4
├── Animation
│   ├── Big\ Hero\ 6\ (2014)
│   │   ├── Big.Hero.6.2014.1080p.BluRay.mp4
│   ├── Cloudy\ with\ a\ Chance\ of\ Meatballs\ (2009)
│   │   ├── Cloudy.with.a.Chance.of.Meatballs.720p.BluRay.mp4
│   ├── Cloudy\ with\ a\ Chance\ of\ Meatballs\ 2\ (2013)
│   │   ├── Cloudy.with.a.Chance.of.Meatballs.2.2013.720p.BluRay.mp4
│   ├── Home\ (2015)
│   │   ├── Home.2015.720p.BluRay.mp4

I have made the script available on my github repo: https://github.com/soumyadipdm/organize-movies

Happy watching movies!!

Posted in Uncategorized | Leave a comment

Bootstrapping CFEngine with Cobbler Snippets and Systemd

Cobbler’s snippets are a module based approach to kickstart scripts. Without any doubt these are really handy to customize kickstart scripts without duplicating the efforts. Recently I wanted to get my hands dirty with RHEL 7 new features. So I fired up my home lab VMs with my old cobbler snippet which was for CentOS 6.x.

In CentOS6.x, you can use /etc/rc.d/rc.local to execute any script/command after all the services are up. But in RHEL/CentOS 7.x, upstart init is replaced by Systemd (very similar to Solaris SMF). There’s still /etc/rc.d/rc.local, but there’s no guarantee that it won’t run until the end. This is because Systemd runs services in parallel.

Problem is, although I could bootstrap CFEngine in the %post section of kickstart script, but it does not guarantee the actual environment that CFEngine expects after a complete OS boot. So while I still could have CFEngine bootstrapped as part of the kickstart installation process, I wanted to do it as part of the “First Boot” (first time when OS boots up after getting installed) process.

So how do I do that. Found out it’s really easy to write up a unit file (it helps systemd to determine dependencies among different services) and then make it run at the end of the default target (a Target is Systemd equivalent of SysV run level, essentially containing a bunch of services to start or stop).

Here’s my cobbler snippet (my GitHub repo) that does the following:

  1. Detects the profile i.e whether it’s CentOS 6.x or CentOS 7.x
  2. If it’s CentOS 6.x, make use if /etc/rc.d/rc.local to run the CFEngine bootstrap script
  3. If it’s CentOS 7.x, create /root/bootstrap_cfe bootstrap script and then write up a temporary systemd service that self destroys itself after the first run

Few things to note:

“After=default.target”  ensures that the service is run only after everything else in the default target has completed.

“WantedBy=default.target” ensures that the service actually runs while executing default target

“Type=forking”  I dont’ know yet the reason, but for shell script, forking is the only type that executes it. Other types do not do anything.

# CFEngine Boot Strap preparation
# detect profile
# if it's CentOS 6.*, we can use /etc/rc.d/rc.local
# if it's CentOS 7.*, we have to use systemd unit file

#set $prf = $getVar('$profile', None)
#if 'CentOS6' in $prf
#raw
mv /etc/rc.d/rc.local /root/rc.local.bak

cat >>/etc/rc.d/rc.local <<‘EOF’ #!/bin/bash if [[ -f /etc/centos-release ]]; then release=$( egrep -o “release 7|6\.” /etc/centos-release ) else release=”None” fi if [[ “${release}” != ‘release 6′ ]]; then exit 1 fi curl -o /tmp/cfengine.rpm “http://192.168.56.201/cfengine/cfengine-community-3.6.5-1.x86_64.rpm&#8221; rm -f /etc/yum.repos.d/*.repo yum -q -y –disablerepo=’*’ install /tmp/cfengine.rpm if [[ $? == 0 ]]; then touch /root/cfe-installed fi if [[ -f /root/cfe-installed ]]; then /var/cfengine/bin/cf-agent –bootstrap 192.168.56.201 fi sleep 2 rm -f /tmp/cfengine.rpm mv /root/rc.local.bak /etc/rc.d/rc.local chmod 744 /etc/rc.d/rc.local EOF chmod 744 /etc/rc.d/rc.local #end raw # end of CentOS6.* CFEngine bootstrap #else if ‘CentOS7’ in $prf # create the bootstrap script #raw cat >>/root/bootstrap_cfe <<‘EOF’ #!/bin/bash if [[ -f /etc/centos-release ]]; then release=$( egrep -o “release 7|6\.” /etc/centos-release ) else release=”None” fi if [[ “${release}” != ‘release 7′ ]]; then exit 1 fi curl -o /tmp/cfengine.rpm “http://192.168.56.201/cfengine-el7/cfengine-community-3.6.5-1.el7.x86_64.rpm&#8221; rm -f /etc/yum.repos.d/*.repo yum -q -y –disablerepo=’*’ install /tmp/cfengine.rpm if [[ $? == 0 ]]; then touch /root/cfe-installed fi if [[ -f /root/cfe-installed ]]; then /var/cfengine/bin/cf-agent –bootstrap 192.168.56.201 fi sleep 2 rm -f /tmp/cfengine.rpm systemctl disable systemd-cfengine-bootstrap.service rm -f /usr/lib/systemd/system/systemd-cfengine-bootstrap.service rm -f /etc/systemd/system/multi-user.target.wants/systemd-cfengine-bootstrap.service EOF chmod 744 /root/bootstrap_cfe # end of bootstrap script file # now create systemd unit file cat >/usr/lib/systemd/system/systemd-cfengine-bootstrap.service <<‘EOF’
[Unit]
Description=CFEngine Bootstrap
After=default.target

[Service]
Type=forking
ExecStart=/root/bootstrap_cfe

[Install]
WantedBy=default.target
EOF
# now enable the service for next boot
systemctl enable systemd-cfengine-bootstrap.service
#end raw

#else
# No profile matched for CFEngine bootstrap
#end if

Posted in Uncategorized | Tagged , , , , | Leave a comment

External Node Classification using OpenLDAP – A Scalable solution for CFEngine, Puppet, Chef

Problem: You have a huge infrastructure with thousands of machines to manage. You already have a configuration management system or looking for one now (weird!!) and weighing what scalable tools you can use to specify role of each node, or a bunch of nodes

Solution: Puppet has hiera for this, if it’s scalable or not that’s debatable. People who are using it might be able to comment on this. But if you need something rock solid as a hierarchical database platform as well as massively scalable, I would advise you take a look at OpenLDAP.

Yes, you heard me right. OpenLDAP has come a long way since the Berkley DB days. Now you have LMDB/MDB (Lightening Memory Mapped DB) backend, which is modelled for high volume read queries. This is primary need for an External Node Classification (ENC) tool. Because using config management systems, we will periodically query ENC for roles. Apart from this, you can further index specific attributes, like cn, member, memberOf, etc.

Next thing, synchronization/replication. At this point, you do not want to write your own YAML based hierarchical DB which lacks sync/replication strategies. More over, you do not want to wait for 30 mins or so for the role/ENC DB update to take place. You need this to happen almost immediately. OpenLDAP provides sync-repl and delta-repl, which are full and incremntal replication, respectively. On top of this, it is event based. Plus you can customize OLC overlday to pinpoint exact attributes of objects that you want to put more focus on. In our case, these attributes would be “member” and “memberOf”.

Above all, scalability. I don’t think anybody doubts scalability of LDAP as a protocol. OpenLDAP has an architecture of Consumer (client as well as slaves) and Provider (essentially master) that can be leveraged to scale OpenLDAP further.

If you care, you do not have to run OpenLDAP solely on hardware. You can dockerize it (thanks RHEL7) or virtualize it traditionally, or run it along with config management systems. OpenLDAP itself is tiny, as long as you have enough disk space to have MDB database (it takes only 270MB to hold 60k hosts + ~900 groups, surprised? Don’t be!!)

You do not have to write codes to maintain integrity of your ENC DB, this is already provided by OpenLDAP in the name of “Referencial Integrity”. With this, whenever you add a host to a group, host’s memberOf attribute automatically gets updated. Remember memberOf attribute saves a hell lot of time while doing reverse lookup i.e “what groups is this host a part of?”

Now it’s time for demo.

What I needed:

1. OpenLDAP 2.4.40 (custom compiled with mdb and referential integrity)
2. Python 2.6/2.7 (haven’t tested in python 3, but it can be easily portable)
3. python-ldap and python-argparse modules
4. A cup of hot coffee

I am using CFEngine as config manager, but the same can be tested with Chef/Puppet and others, if need be.

Schema ldifs can be found in my github repository: https://github.com/soumyadipdm/enc-openldap

Here’s how I use it with CFEngine: https://github.com/soumyadipdm/cfengine_stuffs

I am using two OUs: hosts => for hosts
groups => for roles

You may follow steps mentioned in git repo to setup OpenLDAP for demo.

Here’s how it’s setup:

1. CFEngine promises.cf evaluates extract_raise_class shell script (execution can be delayed using a splay class, if your infrastructure does not see frequent role changes, once per 15-20 minutes execution is probably fine)
2. extract_raise_class in turn executes query-enc.py and results are written to /etc/roles_classes
3. raise_class is called upon by CFE, and this script just cat’s /etc/roles_classes file (this is to avoid execution of costly query-enc.py)

How query-enc.py is querying OpenLDAP:

In case of CFEngine, role discovery happens at the CFEngine client level. Hence, query-enc.py executed for reverse lookup to pull out all roles/groups that the host is a part of.

{unixuser@app01}[enc-openldap|23:59]-> ./query-enc.py hosts:generic_host
app02.local.net
app03.local.net
app04.local.net
app01.local.net
ubuntu.local.net
app05.local.net
app06.local.net

{unixuser@app01}[enc-openldap|00:01]-> ./query-enc.py reverse:`hostname -f`
generic_host
cfe_mps

{unixuser@app01}[enc-openldap|00:03]-> cat /etc/roles_classes
+generic_host
+cfe_mps

Posted in Uncategorized | Leave a comment

Heavenly – A color scheme for putty, created by me

For folks who use Putty (http://www.chiark.greenend.org.uk/~sgtatham/putty/) to connect to linux/unix boxes, and code a lot, good news, you can have this awesome color scheme created by me to make your putty experience super fun.

I have seen so many color schemes for iTerm and others, but I could not find much for putty. It’s either that people have created schemes of their own but did not bother to share with others, or no body cares.

Anyways, as I code a lot, I needed something that’s not mundane, rather full of colors. So here I am, created my own color scheme to revolutionaize my putty experince.

The monospaced font in the screen shots is Source Code Pro from Adobe (https://github.com/adobe-fonts/source-code-pro). I am using Medium and size 10 of this font.

It really looks cool!! Enjoy🙂

This “.reg” file can be obtained from my github repo: https://github.com/soumyadipdm/Heavenly-putty-color-scheme

Screenshots:

Heavenly

Heavenly_2

NOTE: I have made a few modifications to putty:

1. Scrollback lines – 5000

2. No scrollback bar

3. No bell

Posted in Uncategorized | 2 Comments

CFEngine: Here I come to get my hands dirty

So I have been learning CFEngine quite a while now. Being a Puppet guy, it was scary at first. I thought I would get lost in “Bundle”, “Body”, “promise”, “repair” and so on.

But, as always, the best way to learn a new technology (by the way CFEngine is the oldest of all config management tools) is to setup a home lab and try as many dirty things as you can. The more I am learning CFEngine, the greater the respect I feel towards it. It’s awesome and believe me or not: It’s really really simple to write a CFEngine policy, no harder than writing a Puppet manifest.

So wanted to share the policies I wrote for my home lab, including the interesting one: defining a cluster and pushing configs as per the cluster definition.

Checkout my github repo: https://github.com/soumyadipdm/cfengine_stuffs

Many more to come as I explore more and more.

Cheers!!

Posted in Cfengine | Tagged , , , | Leave a comment

Designing a load-balancer based on OS load

This is my first post after two years. Have been in an accident recently and had some spare time to think while taking rest for a month.

“Do we really consider OS load in the load-balancers?” I asked myself one-day while drowsy. There are myriad number of load-balancers which balance network traffic at various layers of TCP/IP. Some are TCP based, some IP, some can load-balance at application level like HTTP etc. But, most of them revolve around network latency etc. and almost none checks for OS load to schedule traffic to a particular node.

“Why?” probably because it is less cumbersome to keep track of the metrics that determine or influence load-balancer’s scheduler. But, what’s the harm in trying to make a tool which can really load-balance traffic based on the OS load metrics like memory usage, CPU usage, io-wait time, etc. “No harm at all” I replied to my own question (yes, I am talking to myself now-a-days after the head-eye injury during accident). So let’s start building a tool which can efficiently provide a “score card” to every node in the cluster and schedule traffic based on who has got a better score. As simple as that!

As a result, I started a project at GitHub, named “Themis” 

Every node in the cluster will have an agent running at a frequency depending upon the overall load of the cluster. The agent deducts (initial score is 100) scores the node based on available resources like CPU wait time, load-average, available RAM, swapiness etc. These can be tunable by predefining thresholds like 20% of the total CPU time is acceptible as a normal wait time for a particular node. The better the score is, the lower the load on that particular node. There is a plan to keep heuristics as well, to make the scoring process a bit smarter over the time by manipulating the thresholds automatically.
After the agent assigns a score to a particular node, it sends that score (along with the trend data, but probably less frequently) over to the load balancer through QPID (v0.30) messaging in JSON format
The load-balancer keeps a queue of all the nodes reverse-sorted based on their scores, i.e the node with highest score will be put in front of the queue and will be scheduled for the next connection, in case this node does not respond, next node in the queue will be considered.
Load-balancer will also keep track of a “participation-score” i.e how many times a particular node has been selected by the load-balancer. The lowest scorers will be reported by the load-balancer along with the different resource-limit/load data pin-pointing the reason why that node was not selected so frequently, thus giving a hint to the sysadmins on the bottlenecks
I have not yet started working on the load-balancer side, nor have I thought of a way to actually switch traffic to the nodes, well not yet. But then there are so many projects going on in that area and I can probably get inspired by one of them or just simply use my tool as an add-on to “influence” their scheduler policies.

So folks, if you are a Python devops/sysops guy and have some ideas to share with me or participate in the project, please drop a comment here, and we can catch up at GitHub.

Lets code it!!!
Posted in Uncategorized | Leave a comment