Pages

Saturday, February 12, 2011

Linux syscall, vsyscall, and vDSO... Oh My!

The following article is a brief investigation I wrote up, which looks into the system call functionality of the Linux kernel. Mainly, I was trying to clarify the difference between vsyscall and vdso. This was conducted, more or less, on 2.6.32 and 2.6.37 branches, if I recall properly. Anyways here goes:


-= What are System Calls =-

System calls are routines that communicate directly with the operating system in hopes of attaining some specific piece of information, or to make a specific request that the OS needs to fulfill. For example, gettimeofday() is a request often from user-land for an application to obtain the current timing information from the operating system's kernel. In Linux this call is implemented as a system call. Often this interface exists as a means of fulfilling a request by a user for kernel information. This interface allows a lesser privileged application to access higher-privileged kernel information [1, 2].

Often these calls are not even accessed directly by an application, rather they are executed via a wrapper. The GNU C library, glibc, has wrapper functions that do all the handy-dandy wrapping. For instance, a call to gettimeofday() in an application is really just a call to glibc's wrapper for the system call for gettimeofday(). There is no overhead because the wrapper, at link-time, just associates gettimeofday() to the system's appropriate version of the routine. If one were to desire to make a true system call, avoiding the wrapper, Linux provides a syscall() routine allowing such. Both wrapper and the explicit syscall versions of gettimeofday() are demonstrated below:


#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/time.h>

void main(void)
{
    struct timeval tv;

    /* glibc wrapped */
    gettimeofday(&tv, NULL);

    /* Explicit syscall */
    syscall(SYS_gettimeofday, &tv, NULL);

    return 0;
}


System calls are loaded at kernel boot-time. All of these calls are accessed by a number. In the example above SYS_gettimeofday is just a constant integer. And this integer is the index into an array (called an interrupt descriptor table) of system calls that are constructed at boot-time. This table/vector can be thought of as pointers to the actual routines [3]. So at index SYS_gettimeofday is a pointer to a piece of code which is where the actual sys_gettimeofday() routine exists.

-= Overhead =-

Originally, invoking a syscall in Linux was actually a pretty expensive process, as it was implemented as a system interrupt (int 0x80). When the syscall interface was called, via the "int 0x80" instruction, the CPU would pass control to the OS. The OS, would look at what value was passed in, such as SYS_gettimeofday, and then the interrupt vector would be indexed and gettimeofday() would get called. Interrupts force the CPU to save the state of execution just before the system call was executed and the system interrupted. This state is restored after the interrupt (in this case a syscall) completes.

Ok, so interrupts are relatively cheap in execution time. However, the way the kernel is designed does add some overhead when a syscall is executed. Linux is divided into two primary memory segments. User-space and kernel-space (userland, kernelland). User-space memory is that which user applications run in. Kernel-space is where all the kernel services run. This segmentation acts as a security barrier, so that crafty and/or malicious user apps cannot directly access the kernel's memory. Making a syscall, as mentioned previously, is just a bridge, a mere wormhole, between these two segments of memory. It is this hop between address spaces that causes quite some overhead, as the kernel must switch between user and kernel-land memory segments, and then back again as the syscall completes [5, 6]. Changing the memory addressing requires some register shuffling, and just imparts more overhead on the whole syscall process.

-= vsyscall and vDSO =-

To reduce the overhead of hopping between user and kernel spaces, a newer mechanism in Linux allows certain syscalls to be accessed directly from userspace, without the need to cross the user/kernel space barrier. This is just what the vsyscall and vdso (Virtual Dynamic Share Object) interfaces do. At boot-time a page of memory is dedicated to containing a subset of syscalls, deemed safe to execute from
userland, that should not cause a security hole for the kernel. The page of memory where these calls lies is mapped into each running process' user-space. Thus, when a call to one of these syscalls is made, no context switch between the memory regions of user and kernel-space is conducted, thus less overhead.

Another interesting means of reducing syscall overhead comes specific to the underlying CPU architecture. Both the more recent AMD and Intel chips have implemented a fast syscall functionality. Instead of issuing an interrupt, programs can issue instructions (SYSCALL/SYSENTER and SYSRET/SYSEXIT) that act faster than a traditional interrupt. The usage of these are based on object code in the OS. Therefore, a programmer does not have to consider how to implement/request the use of a SYSCALL/SYSENTER over a traditional "int 0x80" [1]. When applications are built, based on the architecture and the system, vsyscall and vdso linkage is done automagically.

As demonstrated above, by using the syscall() routine, a traditional syscall will be conducted, even if there is vDSO support (virtual Dynamically Shared Object). However, despite this fact, that call might still be using the newer SYSCALL/SYSENTER CPU instructions. The glibc wrapped gettimeofday() call is what most programs would use. Since the kernel has been designed to use the most efficient mechanism of syscall, that version has the potential to be a virtualized syscall that is mapped into userspace.

To determine if a specific call is using a virtualized (user-space) syscall or a traditional, memory segment-shuffling, syscall, the strace utility can be used. If a true traditional syscall is being conducted, the routine will be output by strace, will look similar to the following:

gettimeofday({1297472587, 581519}, NULL) = 0


According to comments in glibc-2.12.1: "The vsyscall page is a virtual DSO (Dynamic Shared Object) pre-mapped by the kernel" [7].

vsyscall and vDSO are similar in how they work, however there are some slight differences. vsyscall is limited to 4 entries, and is static in memory. Therefore, any statically linked applications can guarantee where vsyscalls are loaded. On-the-other-hand, vDSO is dynamically loaded into the user process, therefore it is not predictable due to Linux's randomized address space layout. If more than 4 vsyscalls are needed, then a vdso should be used instead [8].
For example, run `cat /proc/self/maps` and look at both the '[vdso]' and '[vsyscall]' entries. If your system supports these, the memory range for vDSO is different for each process issued, and vsyscall is totally predictable. If you dont believe me, run 'cat /proc/self/maps' a few times and note the addresses of vdso and vsyscall. Since this example is looking at '/proc/self/maps' the memory mappings displayed are for that 'cat' process [9, 10].

[1] http://en.wikipedia.org/wiki/System_call
[2] Linux User's Manual: intro(2)
[3] http://www.tldp.org/LDP/khg/HyperNews/get/syscall/syscall86.html
[4] http://en.wikipedia.org/wiki/Interrupt
[5] http://en.wikipedia.org/wiki/Kernel_(computing)
[6] http://www.linux.it/~rubini/docs/ksys/ksys.html
[7] glibc-2.12.1
[8] Linux Kernel 2.6.37 arch/x86/kernel/vsyscall_64.c
[9] http://anomit.com/2010/04/18/examining-the-linux-vdso/
[10] http://www.trilithium.com/johan/2005/08/linux-gate/

-Matt

14 comments:

  1. There's a bunch of stuff you've mentioned that I've been wondering about for quite some time. I knew about the int 0x80 stuff (hey, I'm old school from when that's all we had) but I didn't know about the rest. I am going to have to come back to this page and read it several times!

    ReplyDelete
  2. Thanks Mitch. In fact, I used to do all my asm with good ole' int 80. Even if your CPU supports the newer syscall/sysenter instruction, you can still use int 80h.

    ReplyDelete
  3. Hello,

    I came across this great article while trying to understand something. Maybe you would know about it?

    I have observed that the time it takes to call gettimeofday has decreased *significantly* (~500 ns to ~20 ns) when I moved from a 2.6.18 system to a 2.6.33 system (similar h/w).

    2.6.18 uses the vsyscall method while 2.6.33 uses vdso method to make this invocation fast. But, I do not think that this difference can account for such a huge difference in the times.

    Do you have any insight into what has caused this improvement?

    Many thanks!

    ReplyDelete
    Replies
    1. Hi MK,
      Without performing some calculations I cannot say if the speedup you are witnessing is from the switch of vsyscall to vdso. Either way, both vdso and vsyscall avoid the user/kernel context switch. So, in your case, I am a bit interested as to where this speedup is occurring. If possible, I'd compile a basic C program calling gettimeofday(), one for each kernel, and step-through both.

      Delete
  4. Hi Davis,

    Can you please tell me why in some of the kernels there is one page exactly above the beginning of vsyscall mapped and accessible. But neither the gdb nor the /proc//maps really reports it.

    #include

    int main() {
    char *p = (char *) (0xffffffffff600000-2);
    printf("%d\n", *p);
    return 0;
    }

    Running the above simple test case SIGSEGVs only on few kernels one such is 2.6.18-92.el5(rhel 5.8) and on most of the kernels including my desktop(3.2.0-29-generic) it's accessible. But the weird part is even if you step through instruction by instruction in gdb, it reports memory is not accessible but actually it's accessible and either outputs -1 or 0!!

    ReplyDelete
    Replies
    1. Hi. I do not know the answer to your question, off of the top of my head. While it is not reported, it still lies within the userspace of a process. Perhaps it is for threads and holds the room for thread stacks?

      Delete
    2. Hey Davis,
      Thanks for the quick response. There are two parts to the above problem 1) The maps/gdb does not show it 2) Only certain kernels use this space. Looking at kernel source has not helped yet, and the space for threadstack looks unlikely because it's just one page, anything above it is unmapped, and below is vsyscall. So is this some book keeping space for the vsyscall itself? But would be nice to know precisely who uses it and why it's not documented anywhere(including maps:)).

      Thanks.

      Delete
    3. Interesting. I really don't have much to say. Are you perhaps using an alternate libc (e.g. not glibc?). An alternative libc with the combination of a modified/patched kernel might use that unmapped space for some auxiliary purposes. Please keep me posted. Also, feel free to email me directly.

      Delete
    4. You can probably find the answer here: http://lwn.net/Articles/446528/

      It mentions that the variables in the vsyscall page were moved to a non-executable page, somewhere around kernel 3.1.

      Delete
  5. Actually it's the default glibc.

    ReplyDelete
  6. Hi Matt,
    Very nice article - Thanks. I was wondering if you can shed some light on a problem that I am facing in examining the vdso. I am trying to read the proc/pid/mem for a child process - I can successfully read most of it, but not the vdso part. Reading the vdso part from /proc/*/mem gives me Input/Output error. Any idea why this might be happening or thoughts on getting it to work?

    ReplyDelete
    Replies
    1. It might be best that we take this offline. Shoot me an email and we can work on it together. Upon initial consideration, the vdso is read only, so you should be able to read it. You do not change the permissions of the child process do you (presumably not since you can read other portions of its mem)?

      However, the vdso should be the same for all processes. So, you can read it from any context and make use of the data as you wish. Now, remember that it is not writeable, which kinda prevents you from manipulating it.

      In the meantime, check out the following article. I have some code in there, which I borrowed from a resource listed. I believe I did modify it a tad from that in [10] above.
      http://www.linuxjournal.com/content/creating-vdso-colonels-other-chicken

      Delete
    2. Thanks for the quick response! No, I am not changing the permissions of the child process. I checked out your other post as well, but I am still clueless as to what is going wrong.

      Btw, where can I find your email? (Its not there on your blog profile page)

      Delete
  7. Click on my Google plus profile icon. And when you get to the "About" page by my name should have an envelope/email icon. Use that. If it does not work I'll post it here.

    ReplyDelete