I built and prepared some project using the nginx-rtmp
module. Once everything worked
as intended, I started packing it into a Docker image - and suddenly all exec_*
directives
in the nginx-rtmp
config stopped working. Let me invite you to a long debugging session...
The Situation
I need nginx to execute scripts when the RTMP stream starts and stops, so I made my nginx config look something like this:
...
rtmp {
access_log /dev/stdout;
server {
listen 1935;
application main {
live on;
record off;
exec_publish /app/publish.sh;
exec_publish_done /app/publish_done.sh;
}
}
}
This executes the script /app/publish.sh
once data is pushed into the stream, and /app/publish_done.sh
once the stream ends. This worked perfectly fine when I ran it locally, so I thought putting it into a Docker image to deploy to a server would be a matter of minutes.
I was so wrong!
The Problem
Now I had copied the exact same config into a container and tried to run it. The RTMP server itself worked fine, but to my surprise it would not run the two scripts. Soo...
Logs?
What do you do when something doesn't work as you wish? You check the logs.
*5 exec: starting unmanaged child '/app/publish.sh', client: 172.20.0.1, server: 0.0.0.0:1935
Hmm, nginx said it started the process, so maybe...
Permissions?
A classic. Surely it didn't copy the permissions of the source files when creating the Docker image. That must be the issue. I'll just set the executable flag on them and it's going to work!
Except it wasn't. The executable flag was already set. I tried executing the script manually in the container, and that surprisingly worked.
Script fails?
Maybe my script just fails to start the commands, so I replaced it with a simple
echo foooooo > baaaar
The baaaar
file didn't get created.
Wrong PATH?
I tried to change the exec_publish
command to /bin/sleep 1000
, double-checked that /bin/sleep
really exists and tried again, hoping to find a sleep
process in the process list, but no. Only nginx
processes.
Huh??
At this point I almost gave up. This is one of these issues where your browser tabs stop displaying the tab titles because there are too many tabs, and you are the only one single person on the planet pulling their hair out with this specific issue. I tried starting and stopping the stream a few more times in the hope that I'd notice something - and suddenly, I did notice something weird!
My PC fans started ramping up. Pretty unusual when running a text editor, a terminal and an idle nginx, huh? So I looked at the process list again using top
and saw multiple nginx
processes running continuously at 6-7% total CPU load. On a 16 thread CPU this is a red flag, because 6-7% is exactly one thread at 100%.
That meant nginx was probably running in a seemingly endless loop for some reason. Up until now, I was running the Alpine Linux image of nginx-rtmp
. Maybe that is the problem?
Same but different???
To get closer to the issue, I compiled the nginx-rtmp
module from source each locally and inside an Arch Linux Docker container, since that's what I also run locally. Then I tried again - and again it worked locally but not inside the container. I even copied the binaries of the nginx executables and libraries into the container, which did not help.
Now it was clear that something in the code that runs the exec_*
directives must behave differently inside Docker containers.
Tracing the issue
I traced all syscalls the nginx
process inside the container was calling using
strace -f -o /app/output.log nginx
and once I started the stream and nginx tried to execute the script, strace
printed out a lot of failing close
syscalls. Hundreds of thousands of them, with incrementing file descriptor IDs:
...
close(352763) = -1 EBADF (Bad file descriptor)
close(352764) = -1 EBADF (Bad file descriptor)
close(352765) = -1 EBADF (Bad file descriptor)
close(352766) = -1 EBADF (Bad file descriptor)
close(352767) = -1 EBADF (Bad file descriptor)
close(352768) = -1 EBADF (Bad file descriptor)
close(352769) = -1 EBADF (Bad file descriptor)
close(352770) = -1 EBADF (Bad file descriptor)
close(352771) = -1 EBADF (Bad file descriptor)
close(352772) = -1 EBADF (Bad file descriptor)
...
I searched for the file responsible to run the exec_*
directives, which is called ngx_rtmp_exec_module.c
. The only place where close
was called in a loop was in line 781. Here it loops through all numbers from 0 to the sysconf
value SC_OPEN_MAX
, which is specified as "The maximum number of files that a process can have open at any time".
Of course the cleaner way of this would be keeping track of all file descriptors that get opened and then close only these. I don't know the nginx C API though and can imagine that it's not feasible to do that here, so the author of nginx-rtmp
opted for the brute-force method.
And indeed, while this value was set to 1024 on my local system, it was set to 1073741816 in the Docker container. Yes, that is nginx trying to close more than 1 billion (!) file descriptors before starting the specified script.
The Solution
Fortunately, it's very easy to change the number of files a process can open at runtime. A simple
ulimit -n 1024
before starting nginx
is enough to fix the issue.
I haven't found an answer to why Docker sets the limit to such an insanely high value yet.