Code Tidbits: Wrk pt.1

| Comments

I have this plan to do some write up series related to programming and such, mostly as another form of practicing C systems programming. Just when I wondered about what topic I should write, I found this gem written by Will Glozer. So I think I can study the source code and write some little explanation here and there about the code, hence the tidbits. Feel free to critic and correct things if you find mistakes on this. It would be my honour.

How things work from here

I’ll make the writings into several parts, each explaining a module or two around the project. I will start from the nuts and bolts first, going to the main program/module at the last part.

Intro to wrk

wrk, as described by its author, is a modern http benchmarking tool. It’s written in C, event-based, and some part of it is actually stolen from redis sorce base. It’s fairly easy to use, and the stats is comprehensive, you can grab the source from here. Just type make to compile, and you’ll be ready to benchmark your website.

I kinda like the source code for its simplicity and learn a lot from them. The library used from redis including zmalloc and ae. I will going into zmalloc, before that, lets see this aprintf first.

aprintf

You know that in C all strings are null-terminated. The usual usage pattern is to allocate some memory, either by malloc or statically using array of char, and then copy the data while make sure you have enough memories allocated, or else, you may have buffer overflow in your program.

So this is where aprintf come into stage. The idea is simple, allocate enough data just-in-time. By letting the user to make sure all of their string is null-terminated, aprintf can calculate the space needed and reallocate enough memory for it. Here’s the source looks like:

(aprintf.c) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Copyright (C) 2012 - Will Glozer.  All rights reserved.

#include <stdarg.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

char *aprintf(char **s, const char *fmt, ...) {
    char *c = NULL;
    int n, len;
    va_list ap;

    va_start(ap, fmt);
    n = vsnprintf(NULL, 0, fmt, ap) + 1;
    va_end(ap);

    len = *s ? strlen(*s) : 0;

    if ((*s = realloc(*s, (len + n) * sizeof(char)))) {
        c = *s + len;
        va_start(ap, fmt);
        vsnprintf(c, n, fmt, ap);
        va_end(ap);
    }

    return c;
}

aprintf work like sprintf. The trick here is to call vsnprintf with NULL as string buffer and specify 0 as its length, as seen at line 14. By C99 standard, vsnprintf wouldn’t print more than the length specified but will still return the actual size of the string. For example vsnprintf(NULL, 0, "%d", 100); will return 3, but no output since we specify 0 as its output length.

Unlike sprintf, aprintf accepts pointer of pointer as its first parameter, since it will reallocate the memory to fit the new string. That means it doesn’t matter whether we passed either null pointer or already filled string. The function will always work as expected. The caveat here is you’re supposed to pass pointer of pointer of char and not pointer of array of char.

zmalloc

This is a thin layer for libc malloc, or another malloc implementation such as jemalloc or tcmalloc for that matter, and we may specify which underlying malloc to use at compile time. zmalloc itself designed to add size tracking capability for either each allocation and its total size. By prefixing each allocation with few bytes (defined as macro called PREFIX_SIZE) and use this prefix to save the requested allocation size.

Here’s a snippet from zmalloc.h

define HAVE_MALLOC_SIZElink
1
2
3
4
5
6
7
8
9
10
11
12
13
/* Double expansion needed for stringification of macro values. */
#define __xstr(s) __str(s)
#define __str(s) #s

#if defined(USE_TCMALLOC)
#define ZMALLOC_LIB ("tcmalloc-" __xstr(TC_VERSION_MAJOR) "." __xstr(TC_VERSION_MINOR))
#include <google/tcmalloc.h>
#if TC_VERSION_MAJOR >= 1 && TC_VERSION_MINOR >= 6
#define HAVE_MALLOC_SIZE 1
#define zmalloc_size(p) tc_malloc_size(p)
#else
#error "Newer version of tcmalloc required"
#endif

Above is an example how tcmalloc set to be used as an underlying malloc implementation. I want to highlight HAVE_MALLOC_SIZE macro above (line 9, 8-13), which set as a switch to tell the compiler that this version of malloc has its own allocation-size implementation and we prefer to use that (by expanding zmalloc_size to tc_malloc_size).

This bring us to where PREFIX_SIZE defined:

define PREFIX_SIZElink
1
2
3
4
5
6
7
8
9
#ifdef HAVE_MALLOC_SIZE
#define PREFIX_SIZE (0)
#else
#if defined(__sun) || defined(__sparc) || defined(__sparc__)
#define PREFIX_SIZE (sizeof(long long))
#else
#define PREFIX_SIZE (sizeof(size_t))
#endif
#endif

PREFIX_SIZE will expand to 0 if HAVE_MALLOC_SIZE is set, means that zmalloc doesn’t have to allocate extra memory to track the allocation size. Other than that, we will have to allocate extra memory as large as the size of size_t, which corresponds to the compiler’s processor architecture (i.e. 4 bytes for 32bit machine), except if you’re using Sun sparc, in which the size of long long is used.

Before we see how zmalloc implemented, let’s take a look at how zmalloc track the memory:

malloc statisticslink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#define update_zmalloc_stat_alloc(__n,__size) do { \
    size_t _n = (__n); \
    if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
    if (zmalloc_thread_safe) { \
        pthread_mutex_lock(&used_memory_mutex);  \
        used_memory += _n; \
        pthread_mutex_unlock(&used_memory_mutex); \
    } else { \
        used_memory += _n; \
    } \
} while(0)

#define update_zmalloc_stat_free(__n) do { \
    size_t _n = (__n); \
    if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1)); \
    if (zmalloc_thread_safe) { \
        pthread_mutex_lock(&used_memory_mutex);  \
        used_memory -= _n; \
        pthread_mutex_unlock(&used_memory_mutex); \
    } else { \
        used_memory -= _n; \
    } \
} while(0)

static size_t used_memory = 0;
static int zmalloc_thread_safe = 0;
pthread_mutex_t used_memory_mutex = PTHREAD_MUTEX_INITIALIZER;

The first macro above is used to update total memory used by zmalloc, accumulated as a static variable called used_memory. It has optional thread-safe capability, making sure there’s no race condition happened in case several threads using zmalloc concurrently, by using simple mutex. To activate thread-safe, simply call zmalloc_enable_thread_safeness function, which looks like this:

zmalloc_enable_thread_safenesslink
1
2
3
void zmalloc_enable_thread_safeness(void) {
    zmalloc_thread_safe = 1;
}

There is curious line from the macros above,

hmmm
1
if (_n&(sizeof(long)-1)) _n += sizeof(long)-(_n&(sizeof(long)-1));

which actually has nothing to do with the allocation itself, but always make sure the size (_n) is aligned to multiple of sizeof(long) (i.e. 4 bytes for 32bit machine). This is, to assume that all malloc implementation snap their allocation size fit with multiple of sizeof(long). So the return value of allocation size is consistent.

Now to the zmalloc definition, it’s really straight forward:

zmalloclink
1
2
3
4
5
6
7
8
9
10
11
12
13
void *zmalloc(size_t size) {
    void *ptr = malloc(size+PREFIX_SIZE);

    if (!ptr) zmalloc_oom(size);
#ifdef HAVE_MALLOC_SIZE
    update_zmalloc_stat_alloc(zmalloc_size(ptr),size);
    return ptr;
#else
    *((size_t*)ptr) = size;
    update_zmalloc_stat_alloc(size+PREFIX_SIZE,size);
    return (char*)ptr+PREFIX_SIZE;
#endif
}

In fact, it only do three things, first is the actual allocation, a call to malloc with PREFIX_SIZE addition (line 2). Second, optionally save the size at the first few bytes of allocated memory (line 9), we can see that the ptr is casted to pointer of size_t so it will take at least the first 4 bytes to save the size. The third is to update total memory used, by using update_zmalloc_stat_alloc macro.

At return, the pointer location get shifted by PREFIX_SIZE, since the first few bytes is already used and user don’t have to know about it. zcalloc implementation is almost verbatim with this, except the call to malloc get replaced by calloc.

For realloc, it has to make sure that the size value at prefix, and the total size value, is synchronized with the actual memory allocated after realloc. To do this, it simply preserve the old size, reallocate memory, save the new size at prefix area, and then removing the old size from the total used memory. Here’s how it looks like:

zrealloclink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void *zrealloc(void *ptr, size_t size) {
#ifndef HAVE_MALLOC_SIZE
    void *realptr;
#endif
    size_t oldsize;
    void *newptr;

    if (ptr == NULL) return zmalloc(size);
#ifdef HAVE_MALLOC_SIZE
    oldsize = zmalloc_size(ptr);
    newptr = realloc(ptr,size);
    if (!newptr) zmalloc_oom(size);

    update_zmalloc_stat_free(oldsize);
    update_zmalloc_stat_alloc(zmalloc_size(newptr),size);
    return newptr;
#else
    realptr = (char*)ptr-PREFIX_SIZE;
    oldsize = *((size_t*)realptr);
    newptr = realloc(realptr,size+PREFIX_SIZE);
    if (!newptr) zmalloc_oom(size);

    *((size_t*)newptr) = size;
    update_zmalloc_stat_free(oldsize);
    update_zmalloc_stat_alloc(size,size);
    return (char*)newptr+PREFIX_SIZE;
#endif
}

I have a hunch that the expansion for update_zmalloc_stat_free can be removed by passing the delta between the new allocation size and its old size to update_zmalloc_stat_alloc. But this way might be better since the code is easier to follow.

So are you still with me up to this point? Good. I wont blame you if you’re bored out by this. Anyway, we still have some codes left. Here comes zfree:

zfreelink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void zfree(void *ptr) {
#ifndef HAVE_MALLOC_SIZE
    void *realptr;
    size_t oldsize;
#endif

    if (ptr == NULL) return;
#ifdef HAVE_MALLOC_SIZE
    update_zmalloc_stat_free(zmalloc_size(ptr));
    free(ptr);
#else
    realptr = (char*)ptr-PREFIX_SIZE;
    oldsize = *((size_t*)realptr);
    update_zmalloc_stat_free(oldsize+PREFIX_SIZE);
    free(realptr);
#endif
}

As we can see, it is intuitive, it simply decreasing the total used memory size, then free the memory. Of course that if our underlying malloc does not have malloc_size, the pointer should be shifted back so we get our originally allocated memory (line 11-14).

zstrdup and zmalloc_used_memory is really simple. The first only memcpy-ing some string to pointer which is previously allocated with zmalloc. The latter simply return total used_memory.

get RSSlink
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
/* Get the RSS information in an OS-specific way.
 *
 * WARNING: the function zmalloc_get_rss() is not designed to be fast
 * and may not be called in the busy loops where Redis tries to release
 * memory expiring or swapping out objects.
 *
 * For this kind of "fast RSS reporting" usages use instead the
 * function RedisEstimateRSS() that is a much faster (and less precise)
 * version of the funciton. */

#if defined(HAVE_PROCFS)
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

size_t zmalloc_get_rss(void) {
    int page = sysconf(_SC_PAGESIZE);
    size_t rss;
    char buf[4096];
    char filename[256];
    int fd, count;
    char *p, *x;

    snprintf(filename,256,"/proc/%d/stat",getpid());
    if ((fd = open(filename,O_RDONLY)) == -1) return 0;
    if (read(fd,buf,4096) <= 0) {
        close(fd);
        return 0;
    }
    close(fd);

    p = buf;
    count = 23; /* RSS is the 24th field in /proc/<pid>/stat */
    while(p && count--) {
        p = strchr(p,' ');
        if (p) p++;
    }
    if (!p) return 0;
    x = strchr(p,' ');
    if (!x) return 0;
    *x = '\0';

    rss = strtoll(p,NULL,10);
    rss *= page;
    return rss;
}
#elif defined(HAVE_TASKINFO)
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/sysctl.h>
#include <mach/task.h>
#include <mach/mach_init.h>

size_t zmalloc_get_rss(void) {
    task_t task = MACH_PORT_NULL;
    struct task_basic_info t_info;
    mach_msg_type_number_t t_info_count = TASK_BASIC_INFO_COUNT;

    if (task_for_pid(current_task(), getpid(), &task) != KERN_SUCCESS)
        return 0;
    task_info(task, TASK_BASIC_INFO, (task_info_t)&t_info, &t_info_count);

    return t_info.resident_size;
}
#else
size_t zmalloc_get_rss(void) {
    /* If we can't get the RSS in an OS-specific way for this system just
     * return the memory usage we estimated in zmalloc()..
     *
     * Fragmentation will appear to be always 1 (no fragmentation)
     * of course... */
    return zmalloc_used_memory();
}
#endif

/* Fragmentation = RSS / allocated-bytes */
float zmalloc_get_fragmentation_ratio(void) {
    return (float)zmalloc_get_rss()/zmalloc_used_memory();
}

Now this is interesting, zmalloc implement OS-specific code to fetch RSS (Resident Set Size) from current process. This is to reflect the actual memory used by program and later, to calculate fragmentation ratio between total bytes allocated by zmalloc and memory used by process.

The first three functions were enclosed by #if. The first condition is checking procfs, used by linux etc. And that #elif condition is checking for TaskInfo which is found in Mac OS.

Linux MMU use page as a unit, a single page usualy contain 4096bytes. You can check your system page size with this command:
getconf PAGESIZE
In the first function, rss information fetched from /proc/%pid%/stat is counted in page, so it’s only natural that we need to multiply this size with the size of the page to get the actual bytes used by the process.

I’m not familiar with MacOSX system but I can tell from the code above that it’s awesomely simple. The task_basic_info struct has this resident_size field, the struct populated by task_info function and then we can just return the resident size since it’s already counted in byte.

The last case is self-explained. It will just return total allocated memory as RSS since there’s no other way (implemented) to fetch this data.

Fragmentation ratio as defined in the last function is a comparison, the value will never be less than 1 since that mean something is wrong with your system.

And that’s all for zmalloc, I will go into units and stats module in the next part, where zmalloc and aprintf is heavily used.

Comments