Fixing nginx-rtmp exec directives in Docker

I built and prepared some project using the nginx-rtmp module. Once everything worked as intended, I started packing it into a Docker image - and suddenly all exec_* directives in the nginx-rtmp config stopped working. Let me invite you to a long debugging session...

The Situation

I need nginx to execute scripts when the RTMP stream starts and stops, so I made my nginx config look something like this:

nginx.conf
...

rtmp {
    access_log /dev/stdout;

    server {
        listen 1935;

        application main {
            live on;
            record off;

            exec_publish /app/publish.sh;
            exec_publish_done /app/publish_done.sh;
        }
    }
}

This executes the script /app/publish.sh once data is pushed into the stream, and /app/publish_done.sh once the stream ends. This worked perfectly fine when I ran it locally, so I thought putting it into a Docker image to deploy to a server would be a matter of minutes.

I was so wrong!

The Problem

Now I had copied the exact same config into a container and tried to run it. The RTMP server itself worked fine, but to my surprise it would not run the two scripts. Soo...

Logs?

What do you do when something doesn't work as you wish? You check the logs.

*5 exec: starting unmanaged child '/app/publish.sh', client: 172.20.0.1, server: 0.0.0.0:1935

Hmm, nginx said it started the process, so maybe...

Permissions?

A classic. Surely it didn't copy the permissions of the source files when creating the Docker image. That must be the issue. I'll just set the executable flag on them and it's going to work!

Except it wasn't. The executable flag was already set. I tried executing the script manually in the container, and that surprisingly worked.

Script fails?

Maybe my script just fails to start the commands, so I replaced it with a simple

echo foooooo > baaaar

The baaaar file didn't get created.

Wrong PATH?

I tried to change the exec_publish command to /bin/sleep 1000, double-checked that /bin/sleep really exists and tried again, hoping to find a sleep process in the process list, but no. Only nginx processes.

Huh??

At this point I almost gave up. This is one of these issues where your browser tabs stop displaying the tab titles because there are too many tabs, and you are the only one single person on the planet pulling their hair out with this specific issue. I tried starting and stopping the stream a few more times in the hope that I'd notice something - and suddenly, I did notice something weird!

My PC fans started ramping up. Pretty unusual when running a text editor, a terminal and an idle nginx, huh? So I looked at the process list again using top and saw multiple nginx processes running continuously at 6-7% total CPU load. On a 16 thread CPU this is a red flag, because 6-7% is exactly one thread at 100%.

That meant nginx was probably running in a seemingly endless loop for some reason. Up until now, I was running the Alpine Linux image of nginx-rtmp. Maybe that is the problem?

Same but different???

To get closer to the issue, I compiled the nginx-rtmp module from source each locally and inside an Arch Linux Docker container, since that's what I also run locally. Then I tried again - and again it worked locally but not inside the container. I even copied the binaries of the nginx executables and libraries into the container, which did not help.

Now it was clear that something in the code that runs the exec_* directives must behave differently inside Docker containers.

Tracing the issue

I traced all syscalls the nginx process inside the container was calling using

strace -f -o /app/output.log nginx

and once I started the stream and nginx tried to execute the script, strace printed out a lot of failing close syscalls. Hundreds of thousands of them, with incrementing file descriptor IDs:

...
close(352763)       = -1 EBADF (Bad file descriptor)
close(352764)       = -1 EBADF (Bad file descriptor)
close(352765)       = -1 EBADF (Bad file descriptor)
close(352766)       = -1 EBADF (Bad file descriptor)
close(352767)       = -1 EBADF (Bad file descriptor)
close(352768)       = -1 EBADF (Bad file descriptor)
close(352769)       = -1 EBADF (Bad file descriptor)
close(352770)       = -1 EBADF (Bad file descriptor)
close(352771)       = -1 EBADF (Bad file descriptor)
close(352772)       = -1 EBADF (Bad file descriptor)
...

I searched for the file responsible to run the exec_* directives, which is called ngx_rtmp_exec_module.c. The only place where close was called in a loop was in line 781. Here it loops through all numbers from 0 to the sysconf value SC_OPEN_MAX, which is specified as "The maximum number of files that a process can have open at any time". Of course the cleaner way of this would be keeping track of all file descriptors that get opened and then close only these. I don't know the nginx C API though and can imagine that it's not feasible to do that here, so the author of nginx-rtmp opted for the brute-force method.

And indeed, while this value was set to 1024 on my local system, it was set to 1073741816 in the Docker container. Yes, that is nginx trying to close more than 1 billion (!) file descriptors before starting the specified script.

The Solution

Fortunately, it's very easy to change the number of files a process can open at runtime. A simple

ulimit -n 1024

before starting nginx is enough to fix the issue.

I haven't found an answer to why Docker sets the limit to such an insanely high value yet.

Timo's Blog