Detecting which process is creating a file using LD_PRELOAD trick

The other day I was debugging an issue, basically I was trying to figure out which process is creating and writing to a file on Linux.

There are multiple ways to detect this (some are better and more efficient than others). In this post I’m going to explain have I have done this using LD_PRELOAD trick.

This approach might not be the best and most efficient, but it’s a fun one and it can come handy in many other situations as well.

Alternative solutions

Before diving into how to use LD_PRELOAD to solve this problem, lets have a look at a couple of other (more efficient) solutions which could also work.

1. Using auditd

auditd is a subsystem for monitoring and accounting for Linux. Other *nix and BSD based systems also include a similar subsystems (e.g. standard installation of FreeBSD includes audit and so on).

Among other things, auditd allows you to monitor file accesses and writes and that’s exactly what we are looking for.

To use it, you first create a watch file rule using auditctl:

sudo auditctl -w <file path> -p w

I’ve used w flag because we want to monitor file writes and creates.

And then you can use aureport to view audit reports:

sudo aureport -f

2. Using strace, dtrace or a similar tool for tracking syscalls

Another approach is using a tool like strace or dtrace which allows you to monitor all the system calls used by a running process.

Those tools are very powerful (especially dtrace which offers a very flexible scripting language), but the problem with dtrace is that it’s not available on all the Linux distributions yet and strace needs to be attached to an existing running process.

In my case I was trying to find the offending process so this approach doesn’t really work here.

3. Using inotify

Inotify is a kernel subsystem which allows you to monitor and subscribe to file system changes.

The are two problems with this approach which don’t make it ideal for solving this problem:

  1. inotify is used through system calls which means you need to use some other higher-level tool which uses inotify underneath or write some code yourself.
  2. inotify only tells you that a file has been modified, but it doesn’t tell you who modified it.

First problem can be solved pretty easily. You can use an existing tool such as inotifywait or write a couple of lines of code yourself. The good thing is that you can find inotify bindings for most of the popular higher level languages (e.g. there is pyinotify for Python) and some frameworks like Node.js already provide support for it in the standard library (see fs.watch).

4. Using fuser

My friend Lakshmi asked if the fuser command didn’t help me. The answer is no.

fuser is a useful command line tool which lists all the processes which are currently using a file or a directory (underneath is just uses procfs).

The problem is that it doesn’t work well for my use case. It only lists processes which are currently using an existing file. There are two problems with that:

  1. It doesn’t support polling and it only works on an existing file.
  2. It only lists processes which are currently using a file - this means processes which are currently holding a file or socket open.

Both of those problems can be worked around by writing a simple loop which calls fuser indefinitely, but the problem with that is that it will most likely miss processes which only open a file for a short amount of time (e.g. fast open, write, close sequence).

On top of that, this polling approach is very inefficient and there are better and way more efficient approaches for this, like the aforementioned inotify subsystem.

Similar arguments also apply to the lsof approach.

LD_PRELOAD approach

OK, now back to the LD_PRELOAD approach I have decided to use.

LD_PRELOAD allows you to specify a list of of ELF shared libraries to load before other libraries, including libc.

To use it, you simply set LD_PRELOAD environment variable to point to your shared library or libraries.

For example:

LD_PRELOAD=/home/me/mylibrary.so ./myprogram

This approach is very powerful and you can, among many other things, use it to mask functions provided by libc and other libraries. This comes very handy in many cases, including this one.

As other approaches described above, this one also has some limitations:

  • LD_PRELOAD approach doesn’t work with binaries which have suid permissions bit set (see setuid and setgid for more info)
  • If you use SELinux, it will, by default, automatically set AT_SECURE glibc flag on a domain transition (e.g. when you use fork / execve) which means child processes won’t inherit environment variables from the parent process.

I’ve used this approach to solve my problem by masking / wrapping fopen and open function provided by libc.

Wrapped functions behave almost the same as the original ones, the only difference is that they log file access information to a file before calling the original function.

In my case, I’ve logged the timestamp and the pid and name of the process which has called the function.

Some code which shows how you can do that is shown bellow.

log_file_access.c

#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <dlfcn.h>

#define FILE_NAME "/var/myfile"

typedef FILE* (*orig_fopen_func_type)(const char *path, const char *mode);
typedef int (*orig_open_func_type)(const char *pathname, int flags);

int log_file_access(const char *path) {
    if (strcmp(path, FILE_NAME) != 0) {
        // Not a file we are interested in
        return -1;
    }

    // Log file access to a file. Need to use flock or a similar locking
    // approach if all the accesses are written to the same file.
    // ...
}

FILE* fopen(const char *path, const char *mode)
{
    log_file_access(path);

    orig_fopen_func_type orig_func;
    orig_func = (orig_fopen_func_type)dlsym(RTLD_NEXT, "fopen");
    return orig_func(path, mode);
}

int open(const char *pathname, int flags)
{
    log_file_access(pathname);

    orig_open_func_type orig_func;
    orig_func = (orig_open_func_type)dlsym(RTLD_NEXT, "open");
    return orig_func(pathname, flags);
}

Makefile:

log_file_accesses.s:
	gcc -shared -fPIC log_file_access.c -o log_file_access.so -ldl

clean:
	rm *.so

Using it:

LD_PRELOAD=/home/myuser/mypath/log_file_access.so ./myapp

To make it work for all the started processes, I have modified upstart scripts and /etc/profilefile to set the LD_PRELOAD environment variable.

Other use cases for LD_PRELOAD

As noted above, LD_PRELOAD can come handy in many different scenarios. One of the cases worth mentioning is mocking library functions for tests.

A while back when I was still at Rackspace we were discussing how to mock and test MySQL check used by our monitoring agent. MySQL check fetches a bunch of metrics from a MySQL server using a MySQL client library (libmysql).

One of the approaches I have suggested was to use LD_PRELOAD to wrap functions from the MySQL client library and make them return mock data. I thought this was pretty clever, but soon afterwards, Paul came up with even more clever approach using ffi.

It turned out that ffi approach was even simpler and better, but this was mostly because unlike most other languages, Lua includes a really nice ffi library in the core. Okay cffi for Python is not bad either, but that’s mostly because it’s modeled after the Lua one :-)

If you are not or can’t use ffi, using LD_PRELOAD is a good and valid alternative.

Remember that this is just the tip of the iceberg. Other cool use cases include wrapping a ptrace function to prevent debugger detection and anti-debugging techniques in the application and so on.

Edit 1 (January 11th, 2014) - Added a section about fuser.