Who’s killing that process? Who’s dumping prelink files in /tmp? — Linux auditd to the rescue

So on the other day at work we had some interesting issues.

  1. People started complaining that config management tool was killing application processes.
  2. Few days before that we had people complaining about /tmp getting filled up due to mysterious prelink files, and blame was on RHEL 6. Somebody said it was due to a bug (https://bugzilla.redhat.com/show_bug.cgi?id=584550). Hmm!! Somebody knows how to get on to StackExchange ūüôā

Anyways, obviously in both the cases the app owners did not make sense. Because if it was problem with config management tool, or bug in the OS we would see global issues, not on few servers.

So how do you tackle/RCA such issues?

Well, Linux already has something awesome for us. Linux audit subsystem can help a lot in dissecting issue like the above ones.

For issue 1, I enabled audit on `kill` system call, filtering it further with SIGKILL and SIGTERM. You can do this by issuing the following commands:

auditctl -a always,exit -F arch=b64 -F a1=15 -S kill -k log_kill
auditctl -a always,exit -F arch=b64 -F a1=9 -S kill -k log_kill

auditctl(8) man page has a very good run down on the arguments. Here’s a short description of the options of the command above:

-a to append rules after the existing ones. “always” stands for “log it as audit even always”. “exit” stands for “event should be logged under exit list after exit of system call being audited”

-F is filter. Multiple -F are combined with an implicit AND. So here I am looking for 64bit system calls, a0 through a3 are 4 arguments to the system call. Here I am only interested in SIGTERM and SIGKILL signal. Looking at kill(2) man page, I figured that second argument takes signal name/number, so in this case that would be a1=15. kill(2) says the first argument is the PID, so if I really wanted to narrow down, I could even specified -F a0=12646 where 12646 is the PID of the process being audited.

-S takes system call name or number, you can have multiple of them if you want to audit more than one system call. /usr/include/asm/unistd_64.h defines¬†all 320 system calls. You may want to refer that if do not know the system call name or number already. You have to install “kernel-headers” package for your kernel version to get access to the header file.

-k takes a word which can be referred as key. It makes searching logs super convenient, you can even delete the rule using the same key.

Awesome, now that audit rule is set up, all we had to do was to wait for the next incident when application process gets killed.

Here’s a demo of what I did after¬†the app got killed. To demonstrate, I am using a script which sleeps 5 minutes and then wake up only to sleep for another 5 minutes. I pushed it to the background and issued a kill -9 on the PID

Capture

Now let’s see what the audit log has to say about the event. Audit logs are written in /var/log/audit/audit.log in raw format. I used ausearch(8) to filter result, plus it gives really human friendly output.

So I used below command to check the logs.

ausearch -i -k log_kill

-i interprets uid, gid, syscalls, arguments etc. into human readable words by reading /etc/password etc. It really helps, without this option, you get numeric/hexadecimal equivalents.

-k filter the search with the key we created before

And it showed me this:

Capture

—-
type=OBJ_PID msg=audit(01/23/2016 21:09:50.413:174) : opid=12646 oauid=unixguy ouid=unixguy oses=3 ocomm=script_kill_me.
type=SYSCALL msg=audit(01/23/2016 21:09:50.413:174) : arch=x86_64 syscall=kill success=yes exit=0 a0=0x3166 a1=SIGKILL a2=0x0 a3=0x3166 items=0 ppid=1338 pid=1339 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=bash exe=/usr/bin/bash key=log_kill

The log is pretty clear (you just have to concentrate on bold parts of it) on which process sent SIGKILL (exe=) to the process being audited (opid=)

And boy, did we know who was killing the app process!! It was certainly not the config management tool. ūüôā

Now it’s time for issue #2, using the same auditd we can monitor files, directories. The mysterious prelink files are actually generated using /usr/sbin/prelink program. As per prelink(8) man page:

prelink is a program that modifies ELF shared libraries and ELF dynami‚Äź
cally linked binaries in such a way that the time needed for the
dynamic linker to perform relocations at startup significantly
decreases. Due to fewer relocations, the run-time memory consumption
decreases as well (especially the number of unshareable pages). The
prelinking information is only used at startup time if none of the
dependent libraries have changed since prelinking; otherwise programs
are relocated normally.

So someone/some program/script was calling prelink to create prelink files. If we just enabled audit on /tmp/*prelink* files, it would not have given us much. It would only show us that at some point /usr/sbin/prelink was called to create/write/change attribute of the prelink files. But I also want to know who executed /usr/sbin/prelink at the same time. That would give us solid proof on who’s that devil that filling /tmp with junk.

So first let’s enable audit on the mysterious file:

auditctl -w /tmp/i_am_a_mysterious_file -p wa -k mysterious_file

“-p wa” says that I am only interested in “write” and “change of attribute” events of the mysterious file.

And then enable audit on the /usr/sbin/prelink, for demo I am using /usr/bin/touch

auditctl -w /usr/bin/touch -p x -k mysterious_touch

“-p x” here says that I am only interested in execution event of the file in question.

So after a while I execute below to see when the mysterious file was written into or attributes were changed.

ausearch -i -k mysterious_file

—-
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=2 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=1 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.577:418) : item=0 name=/tmp/ inode=25165953 dev=fd:00 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 objtype=PARENT
type=CWD msg=audit(01/23/2016 23:18:06.577:418) : cwd=/home/unixguy
type=SYSCALL msg=audit(01/23/2016 23:18:06.577:418) : arch=x86_64 syscall=open success=yes exit=3 a0=0x7ffff6b47404 a1=O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK a2=0666 a3=0x7ffff6b45c80 items=3 ppid=16652 pid=16655 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_file
—-
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=2 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=1 name=(null) inode=26006543 dev=fd:00 mode=file,664 ouid=unixguy ogid=unixguy rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.265:420) : item=0 name=/tmp/ inode=25165953 dev=fd:00 mode=dir,sticky,777 ouid=root ogid=root rdev=00:00 objtype=PARENT
type=CWD msg=audit(01/23/2016 23:18:20.265:420) : cwd=/home/unixguy
type=SYSCALL msg=audit(01/23/2016 23:18:20.265:420) : arch=x86_64 syscall=open success=yes exit=3 a0=0x7fffb6fa2404 a1=O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK a2=0666 a3=0x7fffb6fa08a0 items=3 ppid=16652 pid=16658 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_file

From the logs, I take note of the start and end time of the operation, as well as uid of the process.

Then I issue below, filtering the timestamp I got from above:

[unixguy@app01 ~]$ sudo ausearch -i -k mysterious_touch -ts 23:18:06 -te 23:18:21
—-
type=PATH msg=audit(01/23/2016 23:18:06.576:417) : item=1 name=(null) inode=16879113 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:06.576:417) : item=0 name=/usr/bin/touch inode=8587749 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=CWD msg=audit(01/23/2016 23:18:06.576:417) : cwd=/home/unixguy
type=EXECVE msg=audit(01/23/2016 23:18:06.576:417) : argc=2 a0=/usr/bin/touch a1=/tmp/i_am_a_mysterious_file
type=SYSCALL msg=audit(01/23/2016 23:18:06.576:417) : arch=x86_64 syscall=execve success=yes exit=0 a0=0x1a92fd0 a1=0x1a96a10 a2=0x19852a0 a3=0x7fffb7234d50 items=2 ppid=16652 pid=16655 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_touch
—-
type=PATH msg=audit(01/23/2016 23:18:20.264:419) : item=1 name=(null) inode=16879113 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=PATH msg=audit(01/23/2016 23:18:20.264:419) : item=0 name=/usr/bin/touch inode=8587749 dev=fd:00 mode=file,755 ouid=root ogid=root rdev=00:00 objtype=NORMAL
type=CWD msg=audit(01/23/2016 23:18:20.264:419) : cwd=/home/unixguy
type=EXECVE msg=audit(01/23/2016 23:18:20.264:419) : argc=2 a0=/usr/bin/touch a1=/tmp/i_am_a_mysterious_file
type=SYSCALL msg=audit(01/23/2016 23:18:20.264:419) : arch=x86_64 syscall=execve success=yes exit=0 a0=0x1a96a10 a1=0x1a9c340 a2=0x19852a0 a3=0x7fffb7234d50 items=2 ppid=16652 pid=16658 auid=unixguy uid=unixguy gid=unixguy euid=unixguy suid=unixguy fsuid=unixguy egid=unixguy sgid=unixguy fsgid=unixguy tty=pts0 ses=3 comm=touch exe=/usr/bin/touch key=mysterious_touch

The key parts of the logs are in bold face. I double confirmed uid, gid, cwd with the previous log. Inode 16879113 turned out to be of linux loader: /usr/lib64/ld-2.17.so. This is because, every time you execute an ELF binary, the loader has to read ELF and load the segments like data etc. into memory (I will write up another blog on this one, maybe later). So even though we could not pinpoint what exact process was executing /usr/sbin/prelink, we know the PWD of the process and UID/GID. In our case, it turned out to be a non-privileged UID used for running apps.

So at the end we could prove that it was not OS bug, rather the app was executing prelink and in turn filling /tmp up.

Yep, all because of auditd!!!

Advertisements

About admin_xor

Un*x/Linux junkie, loves to program, automate, automate and automate
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s