nginx discovery journey: Day 31 - ngx_str

That's quite a title, isn't it. I wonder how google indexer is going to like this. Anyway, that's not the point. The point is that I already wanted to tell you about this yesterday but got distracted by agentzh hints telling me to review my development chain...

As already mentioned before here (or at least in my translation of Valery's work: Development of modules for nginx), nginx uses its own version of strings. It can be tracked to a Pascal-like version of the thing because a string is basically a "chunk" of data and a length. So, in C, it gives something like:

typedef struct {
    size_t      len;
    u_char     *data;
} ngx_str_t;

On the other side of the ring, the current champion: the good old C-style '\x0' terminated chunk of data:

char *;

Now, why would Igor go through so much pain to reinvent the concept of string. I personally see two possible reasons:

Save memory
Save cpu cycles

God save the memory. I see eyebrows rising: is he out of his mind ? A pointer plus a size_t is always going to be bigger than just a pointer. Yes, but one bull is heavier that two frogs, isn't it ? Let me spell it out for you... Let's say you are parsing a file (it works with data coming out of socket, too). If you are a bit into optimization you actually loaded the file in memory with a mmap or something similar. Now, if you are using C-style strings, each time you want to store the configuration directive (the header and its value) you must make a copy of the original data and add a '\x0' at the end of it. At this point, you end-up with a process that is storing two versions of the same string in memory: one from the file and one with the trailing '\x0'. On the other side, with ngx_str_t you can point to the same memory area that you used to read the file and "limit" the size with the length parameter of your structure. It all comes down to how you are planning to use it. And when you are building a web server, being able to reduce the number of copies of the processed data is definitely a good idea.

God save the cpu cycles. I don't know about you, but I can tell you that I have seen a lot of C crippled with strlen which figures out the size of any given string at least 2 or 3 times without even noticing. And the buffer overflow exploits have made it worse. Once people realise they are using non-safe versions of the functions, they switch to the safe version: I'll grep all my strcmp and replace by strncmp. Don't get em wrong, its the right thing to do but most of the time, you end up with a bunch of extra calls to strlen as a result... You could argue that this is true of beginners and we all know Igor is no beginner. But he knows that one of the principles of the HTTP protocol is basically to "say" how long data is going to be before sending it. As there is a lot of length manipulation, I think having the length handy all the time when working with a string is definitely a good idea.

On the other hand, of course, whenever you have to interact with "traditional" C code, you have to convert it. But: this is a small price to pay. And this is also the reason why most of the string manipulation functions you know and love have their ngx_* counterpart. This way you don't have to convert before comparing two strings.

So, ngx_str_t looks like a winner (at least for the work at hand). But...

One Ring to bring them all and in the ~~darkness~~ light bind them. Where did you expect a geek to find his references talking to other geeks? ;) My personal favorite solution would have been to avoid deciding and bring together the best of the two worlds with an object that could present a c-style interface and a Pascal-like one. I haven't looked recently at the implementation of the string object in C++, but almost 15 years ago, they already had reference counting, length lazy-caching and copy-on-write. But, for some reason (don't ask me why, I have no clue), nginx is C although a lot of time with the callbacks and handlers and modules it feels object-like.

Tuesday, March 1, 2011

Day 31 - ngx_str_t vs. char * requiem

No comments:

Post a Comment