Whenever I've seen this, I would scratch my head, poke around a bit, and not really get anywhere with it. Today, I had the time to isolate one of these gremlins and see what was going on.
Strangely, the child process was unresponsive to strace or pstack, so I went back to the parent process and gave it a kill -3
Since Jenkins uses such descriptive thread names, I was able to find the Git polling log for the job and saw that it attempted to launch 6 days ago, and was still stuck. Even better, the timestamp on that git polling log and the timestamp on my Mysterious Jenkins Fork were exactly the same!
So, why was this git process being launched as a new jenkins process? I discovered this when I googled for "UnixProcess.forkAndExec" on a lark, and came across this wiki page in the first result. As per that page:
In Unix, the way Java launches a new process is first by forking the current process and then exec-ing. Therefore, for a brief period between fork and exec, there's temporarily two instances of your application server running on your system. If the application server you are running Hudson in is using 1GB virtual memory, this means you temporarily need 2GB of virtual memory
I did not know that! The wiki page goes on to talk about needing more swap space as a consequence of this. But we aren't getting that particular error message, so it doesn't seem likely that memory is our problem. Nevertheless, there are other resources that a process needs, such as file descriptors.
How many file descriptors does a Jenkins instance need? Based on what I've seen by doing a lsof -p
Interestingly, Fedora defaults to a maximum of 1024 open file descriptors (ulimit -n) for a given user. So if a single Jenkins instance forks to exec git, then it could temporarily go over this limit. I'm not sure why the process responds to this limit by consuming 100% of CPU instead of throwing an error -- this seems like a bug in Linux or the Sun JVM.
However, this would help explain why we rarely see this problem -- we're right at the maximum for normal case, and may only hit it when multiple git SCM pollings are running for a given Jenkins instance. We're going to increase the maximum open files a bit (the nofile setting in /etc/security/limits.conf) and see if that helps alleviate this issue.
[Update] Martin Ellis points out that Linux uses copy on write when forking so you shouldn't need to have double the memory available. He also points out the fork man page which may suggest that file handles are shared between the two processes:
The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two descriptors share open file status flags, current file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).
This might suggest that our guess (open file limits exceeded causes CPU spin) is malarkey. I'm not sure how to know for sure, though.