Monday, April 25, 2011

Day 49 - the radix tree: ngx_radix_tree_t

It all started with locations and ended as an exploration in the binary search tree algorithms. That's what I like with this trip: you never know where you are going to end up. So, for today I decided to look into the last tree structure provided by nginx (or at least, the last tree structure I could find). There might be others but I have not run into them...so far.

I already gave you hints at what a radix tree is. I told you the tree used for finding locations is a mix between a radix tree and a binary search tree. I even gave you a link to the wikipedia definition. If you haven't read it, you can look at my post about locations, queues and location trees, find the link, click on it and read the definition (I'm not doing all the link copy/pasting for you this time). So, basically, a radix tree is a tree where each node gets you further down a string. Something like:

/foo/
 \_b
 |  \_ar
 |  |  \_*
 |  \_iz
 |     \_*
 |     \_ness
 |        \_*
 \_crap
    \_*

My ascii-art is just getting better and better... ;) I think I'll make a career switch and become the Warhol of ascii-art. ;) In the meantime, I'll still comment on my masterpieces. Here there is not much to explain except that * denote a dead-end (or a string end if you prefer). Now, the example above contains the following strings:

  1. /foo/bar
  2. /foo/biz
  3. /foo/bizness
  4. /foo/crap

Now, let's look at nginx version of a radix tree (or rather at what a node looks like):

typedef struct ngx_radix_node_s  ngx_radix_node_t;
struct ngx_radix_node_s {
    ngx_radix_node_t  *right;
    ngx_radix_node_t  *left;
    ngx_radix_node_t  *parent;
    uintptr_t          value;
};

I can see your face from here: radix trees are not binary trees. Why in hell, do nodes have only a left and a right child. Maybe radix trees are binary trees in disguise. Or fig tree, while we are at it... ;) Well, the real answer is that usually they are not but sometimes they might be. And in the case at hand, they are. So, let's see what is this case.

If you look into the source code you'll see there is only one place where ngx_radix_tree_t is used in nginx code base: the HttpGeoModule. Watch out: don't mistake this module for the HttpGeoIPModule:

  1. HttpGeoIPModule creates (and assigns values to) variables based on the IP address of the request client and one of Maxmind GeoIP databases. One of the common uses is to set the country of the end-user as a nginx variable.
  2. HttpGeoModule lets you map IP ranges to values. For each request, the resulting value is assigned to a variable of your choice. This is a more generic mechanism but it does not come with its own database of countries so if you were to do what HttpGeoIPModule does with this module, you would have to write your own IP-to-country database (a thing most people don't want to). A quick example:
    geo  $srvr_platform  {
      default          prod;
      127.0.0.1/32     dev;
      192.168.1.0/24   int;
      10.1.0.0/16      preprod;
    }
    

Back to our radix tree. Now you probably guessed that nginx uses a radix tree to store the various IP ranges configured with the geo directive. And storing IP related information is one of the common uses of radix trees.

Here, the string is not made of characters but of bits and the IP-address is just a 32-bit address. And a bit can be only 0 or 1 (at least last time I checked). Therefore, for this use you need a max of two branches out of any node in the tree: one for 0 (the left branch) and one for 1 (the right branch). It also explains why there is no key in the ngx_radix_node_t structure: you don't need it.

Although this is named a radix tree, if you take wikipedia's definition, it's a trie because the nodes with only one child are not "merged together" like they should be in a radix tree. As a consequence, this structure "costs" a bit more memory than it should. However, if you look closely at the code you'll see that Igor put quite some effort in being extra careful with memory allocation (actually pre-allocating a big chunk of memory for this tree) and making sure the extra size does not break performance by generating page faults. That's the way Igor is if you had not figured yet... ;)

Now that you know all this, you should see there is something wrong (and that will become more and more wrong as time goes): this assumes IP addresses are 32 bit long. With IPv6 they are not anymore... Still, they keep their hierarchical nature so everything is not to be thrown away but at least the functions ngx_radix32tree_find, ngx_radix32tree_insert and ngx_radix32tree_delete are in for a serious lifting... Sounds like a nice feature improvement for anyone who cares to submit a patch.

Sunday, April 24, 2011

Day 48 - ngx_rbtree_t

I told you so much about red-black trees over the last few days that I cannot pass on the pleasure of looking at the nginx version of those beasts, namely the ngx_rbtree_t structure.

From core/ngx_rbtree.h, you get the following definition of a tree:

struct {
    ngx_rbtree_node_t     *root;
    ngx_rbtree_node_t     *sentinel;
    ngx_rbtree_insert_pt   insert;
};

A red-black tree is a root node, a sentinel node (for those of you who didn't pay attention to the part about ngx_queue_t in my previous post about locations and ngx_queue_t, here is what wikipedia defines as a sentinel node) and a ngx_rbtree_insert_pt, also known by its friends as a void (*ngx_rbtree_insert_pt) (ngx_rbtree_node_t *root, ngx_rbtree_node_t *node, ngx_rbtree_node_t *sentinel), also known as a pointer to a function that will be invoked to insert a node in the tree.

A node is something like that:

struct {
    ngx_rbtree_key_t       key;
    ngx_rbtree_node_t     *left;
    ngx_rbtree_node_t     *right;
    ngx_rbtree_node_t     *parent;
    u_char                 color;
    u_char                 data;
};
  • left and right are the usual names for the children nodes that all nodes of a binary search tree have (yes, a red-black tree is a special kind of binary search tree but you should really read my other posts...).
  • parent is easy to figure out.
  • color can be red or black (I won't get into the details of the red-black algorithm, but roughly you have one node out of two red and the other black.
  • I left the best fields out: key and data because they really depend on what the red-black tree is like. The best way to explain how this works is probably to look at an example. I picked up the DNS resolver. Well, one might wonder why nginx needs its own DNS resolver. This is probably a good topic for exploration in the future. For now, all I know is that sometimes nginx cannot figure out at configuration time what is the backend (or upstream) server it needs to talk to. And when it does, it needs to resolve the name. And it resolves it asynchronously (just like everything it does) and puts the result in a cache (which happens to be a red-black tree).

But be fore we move to the example, I would like to give you a brief list of the interesting functions/macros defined by core/ngx_rbtree.h:

void ngx_rbtree_init(ngx_rbtree_t* tree, ngx_rbtree_node_t* s,
                     ngx_rbtree_insert_pt ins_func);
void ngx_rbtree_insert(ngx_rbtree_t *tree, ngx_rbtree_node_t *node);
void ngx_rbtree_delete(ngx_rbtree_t *tree, ngx_rbtree_node_t *node);

Hold on, something must be missing. There is no search/find/whatever you want to call it function that will retrieve the node I'm looking for. I thought red-black trees were binary search trees... ;) Well, they are! But Igor doesn't know what we are searching, so he can't help. It's also why we have to provide our own insertion function: this function has to figure out where to insert the node in the tree. But, hold on, there is a ngx_rbtree_insert. At this point (two levels of "hold on"), pretty much everybody should be lost and we should use the ultimate clarification tool: bullet-points. ;)

  • About finding:
    • nginx doesn't know what you are storing in the tree.
    • Therefore, it cannot figure out by itself if node A is before node B (or after).
    • This could be solved by using a "comparator" function. For whatever reason, Igor did not go this way. Instead he left this task to the user of the red-black tree. This is usually a simple loop. So, most people trying to use a red-black tree should be able to write it.
  • About inserting:
    • The first step of an insertion is a red-black tree is to find where to insert the node.
    • To do so, you need to know how to compare two nodes.
    • For the reasons mentioned above (in "About finding"), you (developer using ngx_rbtree) have to provide this infamous insertion function.
    • The second step is rebalancing the tree. This is a fairly complex algorithm (the one that makes use of the color of a node).
    • Therefore, it makes perfect sense to let Igor implement this algorithm (with some help from "Introduction to Algorithms" by Cormen, Leiserson and Rivest - or so he claims in the comments).
    • As a consequence, we end up with a ngx_rbtree_insert function provided by nginx that delegates the work in step 1 to ngx_rbtree_t.insert and does the heavy lifting of step 2.

Now, since I promised an example, let's look at how core/ngx_resolver.c uses this red-black tree. It all starts with the ngx_resolver_node_t structure:

typedef struct {
    ngx_rbtree_node_t         node;
    ngx_queue_t               queue;

    /* PTR: resolved name, A: name to resolve */
    u_char                   *name;
    [...]
}

Same trick as with ngx_queue_t: do some object inheritance by "glueing" your data after the structure you inherit from. And this also explains why the last element of the ngx_rbtree is named data (and why it has such a meaningless type): it is just a marker to show where the actual data (as opposed to the overhead due to the red-black tree) starts.

Function ngx_resolver_lookup_name does the lookup to retrieve the "resolver context" (a structure holding all the info needed to perform and track name resolution) based on the name to resolve. An interesting point is that it takes as argument a hash of the name on 32 bits (you'll see why very soon). And the code is so simple, I can paste it here:

    while (node != sentinel) {
        if (hash < node->key) {
            node = node->left;
            continue;
        }
        if (hash > node->key) {
            node = node->right;
            continue;
        }

        /* hash == node->key */
        do {
            rn = (ngx_resolver_node_t *) node;
            rc = ngx_memn2cmp(name->data, rn->name, name->len, rn->nlen);
            if (rc == 0) {
                return rn;
            }

            node = (rc < 0) ? node->left : node->right;
        } while (node != sentinel && hash == node->key);
        break;
    }

Yes, this red-black tree structure uses a hash of the string to index its elements and, once this is not enough (after all it's just a hash), it uses the string comparison. The main benefit is obviously speed as you don't have to perform a string comparison for each node you traverse.

And last, the function ngx_resolver_rbtree_insert_value is provided to insert the node and its code is very similar to the one from ngx_resolver_lookup_name because it first searches where to insert the node. Then, all it does is to attach it and paint it red (something mandatory for the balancing algorithm to work).

One last thing: I lied. ;) There are a couple of insert functions provided by nginx for "standard" nodes:

  • ngx_rbtree_insert_value if what you are indexing is an unsigned int and you put it in key.
  • ngx_rbtree_insert_timer_value for timers.
  • ngx_str_rbtree_insert_value and ngx_str_rbtree_lookup (look in core/ngx_string.cif you index strings and use key to store a 32-bit hash of your string. If you look at the code you'll also notice that the string should be stored starting at data. Does that ring a bell?

There might be others but this should be enough for most uses and if it's not, that gives you plenty of examples to work with.

Friday, April 22, 2011

Day 47 - locations, ngx_queue_t and ngx_http_location_tree_t

As a follow-up to my previous post on debugging locations, I went even further in the location internals and that's what I'm about to share with you. But before we get ourselves carried away, I would like to apologize for an unforgivable mistake in my previous post: I called the structure used to retrieve static locations a "red-black (or binary) tree".

Shame on me. For this, I probably deserve to have to run Windows Vista for at least a week... ;)

No worries, I'm not going to do it.

After spending quite some time on wikipedia, it appears that I was quite mistaken on binary trees, red-black trees and the like. I'll just take a short moment to tell you what I now understand to be the difference (and hopefully be less verbose that wikipedia):

  • A binary tree is simply a tree in which each node can have up to 2 children.
  • A binary search tree is a binary tree which guarantees that all nodes on the left of a given node (let's call it N) are before N and that all nodes on the right of N are after it. This kind of structure is great for searching things fast. Or at least it's great on "average" (it's in O(log(n)), n being the number of nodes in the tree). Of course, if you are out of luck and the tree is "unbalanced" and you are looking for something at the very end of the tree), then it is as bad as looking at each node individually. That's called being in O(n) and it's pretty bad.
  • So, a bunch of smart people invented "self-balancing binary search trees" (there are tons of variations on this theme). And the most famous of these structures is known as red-black tree. I always thought that red and black were just different names for the left and right edges that come out of a node. I was wrong. Red and black refer to colors that you give to nodes. The color of a node comes very handy when you need to insert (or delete) a node in the binary tree still keeping the overall tree balanced. The nice thing about red-black trees is that they do such a good job at keeping themselves balanced that they are fast (i.e. O(log(n)) ) even in the worst case scenario.

That should clear things up and if you want to dig more, there are an impressive number of binary tree structures documented on wikipedia. Now, what nginx uses to store locations (see how to debug location) is not a red-black tree. It's some kind of a mix between a radix tree (which is a kind of trie) and a binary search tree. I still might be slightly wrong here, so if you have a better name for what ngx_http_location_tree_t is, just comment here. I must say that after finding out how many different trees and data structures there are behind something I always considered simple (binary search trees), I decided to let it go and thought again about reading Donald Knuth's "The Art of Computer Programming" (still haven't purchased it though as I'm a bit afraid to discover how little I know after all those years).

Now that apologies and theory are behind us, let's have a closer look at the process that builds this tree. Nginx is an engine. Locations processing being specific to HTTP, it is handled by the corresponding module (namely the HttpCoreModule). This module handles (among other things) the processing of the http and location blocks. Here is how this goes:

  • http block is parsed. It triggers parsing of server and location blocks (which must be inside http to respect the configuration file syntax).
  • As a result, the "main" configuration (structure ngx_http_core_main_conf_t) is filled with an array (field servers) of server configurations (structure ngx_http_core_srv_conf_t), each of them containing a ngx_queue_t of locations.
  • The ngx_http_init_locations function will sort those locations and split them into three different queues: the "named" locations, the "regexp" locations and the "static" locations. Processing for the "named" and "regexp" locations pretty much stops here (they are then used in their order of appearance in the configuration).
  • "static" locations must be transformed into the final search tree. Function ngx_http_init_static_location_trees does exactly that and the algorithm goes:
    • Join "exact" and "starts with" locations if they have the same "name" (e.g location /foo {...} and location = /foo {...}). In the final tree structure they both will be attached to the same node (the one containing "/foo" in our example). This is function ngx_http_join_exact_locations
    • Group locations families (function ngx_http_create_locations_list). For example, if location 1 is "/foo" and location 2 is "/foo/bar", then the second location is attached to the first one (through field list of structure ngx_http_location_queue_t). Note that this function won't group "/fuu/bar" and "/fuu/baz" because "/fuu/baz" is not a child of "/fuu/bar".
    • Build the "search" tree (function ngx_http_create_locations_tree):
      • Split the list in two.
      • Assign the location in the middle to the root node of the tree. If this location has "children" locations, recursively process them (through ngx_http_create_locations_tree) and store the result as subtree of the node.
      • Recursively process the first sub-list and assign it to the left edge of the node.
      • Recursively process the second sub-list and assign it to the right edge of the node.

In this code Igor makes extensive use of the ngx_queue_t structure:

typedef struct ngx_queue_s  ngx_queue_t;
struct ngx_queue_s {
    ngx_queue_t  *prev;
    ngx_queue_t  *next;
};

It is defined in ngx_queue.h and comes with a lot of macros (yes: macros, not functions - for performance reasons, of course) for its manipulation. The names are good enough so comments are not needed. Still I added "type" information that you would likely have if they were functions (don't be surprised by the difference if you look at the code):

void ngx_queue_init(ngx_queue_t* q);
bool ngx_queue_empty(ngx_queue_t* q);
void ngx_queue_insert_head(ngx_queue_t* q, ngx_queue_t* new_item);
void ngx_queue_insert_after(ngx_queue_t* after_item, ngx_queue_t* new_item);
void ngx_queue_insert_tail(ngx_queue_t* q, ngx_queue_t* new_item);
ngx_queue_t* ngx_queue_head(ngx_queue_t* q);
ngx_queue_t* ngx_queue_last(h);
ngx_queue_t* ngx_queue_sentinel(h);
ngx_queue_t* ngx_queue_next(q);
ngx_queue_t* ngx_queue_prev(q);
void ngx_queue_remove(ngx_queue_t* item_to_remove);
void ngx_queue_split(ngx_queue_t* q, ngx_queue_t* split_item, ngx_queue_t* tail_queue);
void ngx_queue_add(ngx_queue_t* queue1, ngx_queue_t* queue2);

Now, I see the java guys barking: "This is not strongly typed, this russian kiddo is mistaking a queue for its item". Look at the java version of a queue and you will see that a queue is not mistaken for its items... ;) Well, C is C. What did you expect? Arguably, one could do a better job using a few typedefs but it's beyond the point. The thing is that using a double-linked list to implement a queue is a very good way to have excellent performance on insertion, deletion, splitting and sorting (moving an item around is just a few pointers assignments). And to top things, you can even do this for any kind of object as items. If you are smart enough to build the items as a structure that starts with a ngx_queue_t structure. Just like:

typedef struct {
    ngx_queue_t                      queue;
    ngx_http_core_loc_conf_t        *exact;
    [...]
} ngx_http_location_queue_t;

And now, you can even call the function ngx_queue_sort that takes as arguments the queue to sort and a comparator function that knows how to compare two items of this queue. Yes, javaboys are going to argue that this is all a poor man's way of implementing inheritance and/or templates. And code is less readable. Yes, it's all true. But the result is very fast and uses very little memory. Anyway, I thought that might be interesting to explain as this is exactly the way the sorting in ngx_http_init_locations is implemented (the comparator is ngx_http_cmp_locations and it sorts according to a very particular order: the one needed by the rest of the algorithm for the grouping described above).

The ngx_queue_t structure being really a "circle", it uses a sentinel node which acts as the marker of the beginning and end of the queue but also as the queue itself. In all the macros/functions above you are supposed to hand over the sentinel as q parameter. If you don't, you might get unexpected results. So, in the end the queue is not mistaken for its items. It's just not very strongly typed...

One last note on the static locations tree. I said at the beginning that it is part-radix-tree. The reason why it is not a true radix tree is that if it were, in the example mentionned earlier the presence of locations "/fuu/bar" and "/fuu/baz" would create a tree like that:

/fuu/ba--r
  |
  z

And such a tree where some node might not be locations would not be so efficient during request processing because you don't know how far you would have to backtrack in order to get a "containing" location. With the current implementation you know you only have to go one level up.

OK, I have to stop with these super-long posts. But, hey we are really diving into the core of the thing and waters are not always as shallow as it looks.

Tuesday, April 19, 2011

Day 46 - how to debug location in nginx

Last time I told you about the different kinds of locations in nginx. You've been a good student, understood everything and started putting together the perfect configuration for your perfect web site. Now you have this huge configuration file and you don't get why this damn location is not being hit. The coffee machine ran out of coffee 3 hours ago and, unlike me, you don't have a fig tree to sit under and think about it (by the way, you should really consider buying one... ;)). Instead, take a deep breath and read this post.

If you ask for help on the mailing list, the response you'll get is in two steps:

  1. Show us your full config (you cannot imagine the number of people who send just a snippet and at the end realize the problem is "somewhere else").
  2. Look at the debug logs.

And we are going to do exactly that, with a bit of commenting... For those of you who want to experience it (and not just read lazily from their couch), here is how you turn on debugging log. And here is the configuration we are going to play with (server part is left as an exercise to the reader or just copy/paste this into the default config, it should work):

    location /foo {
        echo foo;
    }
    location /fuu {
        echo fuu;
    }
    location /fuu/bar {
        echo fuu-bar;
    }
    location = /foo {
        echo foo-eq;
    }
    location ~ /foo/bar {
        echo foo-bar-regexp;
    }
    location @/foo/bari {
        echo foo-bar-internal;
    }
    location ^~ /foo/bar {
        echo foo-bar-starts-before-regexp;
    }

Now, let's see what happens with GET /foo/bari. The logs look something like:

18:58:48 [debug] 12569#0: *1 http request line: "GET /foo/bari HTTP/1.1"
18:58:48 [debug] 12569#0: *1 http uri: "/foo/bari"
18:58:48 [debug] 12569#0: *1 http args: ""
[...]
18:58:48 [debug] 12569#0: *1 test location: "/fuu"
18:58:48 [debug] 12569#0: *1 test location: "/foo"
18:58:48 [debug] 12569#0: *1 test location: "/bar"
18:58:48 [debug] 12569#0: *1 using configuration "/foo/bar"

There is already a whole lot of information here so let's have a closer look. The most obvious is of course using configuration "/foo/bar". nginx tells you which configuration it is using (/foo/bar). First thing to notice is that it does not tell you which type of location it is using. This is kind of annoying actually (yes, I'm a big nginx fan but positive criticism is allowed, even for a russian web server ;)). In our example we have two "/foo/bar" locations. Which one is nginx referring to ? Well, here it's fairly easy to tell: the one it did test. ;)

That's where the test location lines become handy. Let's have a look at them. nginx started testing the URI against the "/fuu" location and figured it did not match. So, it went for the "/foo" location which matches. But it did not stop and went for the "/bar" location. Hold on, there is no "/bar" location ! You see, you should buy a fig tree... ;) More seriously (although this has to do with trees), what you are seeing here is actually nginx walking down the red-black (or binary) tree where it stores the locations. So, here is what our configuration translates into, tree-wise:

 [/fuu](s)--*
   |  \\    
   |   \\
   |   [/bar](s)--*
   |     |  \\
   |     |   \\
   |     *    *
   |
 [/foo](s,e)---*
   |  \\     
   |   \\
   *   [/bar](s)---*
         |  \\
         |   \\
         *    *

First, I would like to take a moment to admire the beauty of this. It deserves to enter the hall of fame of ascii-art (at least ;)). Trust me, it looks better than the (open|libre)office presentation. But it still deserves some kind of comments/legend:

  • The [] denotes the "test" for a node in the tree. If you're much into reading code, this is the name field of the ngx_http_location_tree_node_s structure. Based on the result of comparing the URI with this name, you will end-up going left, going right or staying at this node in the tree (may be going down the sub-tree).
  • The () denotes that this node is also a location. It can be of two different types:
    • s for "starts with", also known as inclusive (which is a pointer to the location configuration structure) in the source code.
    • e for "exact", also known as exact (a pointer to the location configuration structure) in the source code
  • - and | denote respectively the left and right branches that grow out of the node. Basically, if what you are testing is "before" the content of [], you go right (or |). If it is "after", you go left (or -).
  • \\ denotes the path (also known as the tree in the source code) you should take if you matched exactly and still have some URI left-over.
  • * denotes an empty node. Think of it as /dev/null, or better: don't think at all. ;)

Now, first thing of interest is that there are only 4 nodes and 5 locations here (/foo is both "starts with" and "exact"). Now, this makes perfect sense and confirms that this red-black tree is used only for static locations (no "regexp locations" and no "named locations"). Also note that there is no special mark for "starts with before regexp" locations. It also means that there is no way to have a location with the same value be both "starts with" and "starts with before regexp" (if you don't trust me, try it for yourself).

Now let's get back to looking at our logs and at the walk down the tree for GET /foo/bari:

  1. nginx took right at [/fuu]
  2. Then it tried [/foo] which happened to be a match.
  3. So, it went straight (neither left nor right). Unfortunately there is no indicator of this in the logs, so you have to look at the test and the URI to figure it out. :(
  4. First node in the "sub-tree" is [/bar] which is tested (logs say so). And it matches.
  5. So, nginx tries to go straight but the "sub-tree" is empty and it stops here (hence the "using configuration /foo/bar").

Easy, no? ;) So, let's have a look at other requests for the fun of it. GET /foo gives the following logs:

18:58:48 [debug] 12569#0: *2 test location: "/fuu"
18:58:48 [debug] 12569#0: *2 test location: "/foo"
18:58:48 [debug] 12569#0: *2 using configuration "=/foo"
  1. Test "/fuu". It doesn't work, so take a right.
  2. Test "/foo". It matches.
  3. There is nothing left in the URI and there is an "exact" location (denoted e) at this node. So, it's a final match.

Notice how nginx is a good boy and tells us it found an exact location (the = before /foo).

GET /foo/ba is another interesting case:

18:58:48 [debug] 12569#0: *3 test location: "/fuu"
18:58:48 [debug] 12569#0: *3 test location: "/foo"
18:58:48 [debug] 12569#0: *3 test location: "/bar"
18:58:48 [debug] 12569#0: *3 test location: ~ "/foo/bar"
18:58:48 [debug] 12569#0: *3 using configuration "/foo"
  1. Test "/fuu" and take a right as usual (by now you should know the way ;)).
  2. Test "/foo" and go straight.
  3. Test "/bar" which does not match. So, stop here.
  4. Give a shot at the "regexp" location (which fails).

First regexp location test. Now, the real question is "why haven't we seen it before?". Well, GET /foo was an exact location match. In such a situation there is no point in trying regexp. GET /foo/bari was a "starts with before regexp" location, so once it is found to be a valid location for the request there is no point in trying the "regexp" locations. You've noticed how the "before regexp" part doesn't show in the tree above. This truly reflects the code: it is a property of the location configuration structure (namely the noregex field) that regexp should not even be tried if the location matches. Now that you know why we haven't seen it before you also know why we are seeing it now: the "/bar" test did not work, so the matched location after walking the tree is the "/foo" location which is a simple "starts with" (noregex = 0). Therefore, nginx tries the regexp locations. A useless attempt here, so the configuration actually used is "/foo", just like the logs tell us.

To try the regexp and see what it does, let's use GET /not/foo/bar:

18:58:48 [debug] 12569#0: *4 test location: "/fuu"
18:58:48 [debug] 12569#0: *4 test location: ~ "/foo/bar"
18:58:48 [debug] 12569#0: *4 using configuration "/foo/bar"
  1. Test "/fuu" and for once, let's take a left.
  2. Ooops, it's a dead-end. So we stop walking the static locations tree. And we have not found a matching location. :(
  3. nginx gives a shot at the "regexp" location which, lucky us, matches.

Bottom line, the logs tell us which configuration is used: "/foo/bar". Too bad there is no way to figure out this is the "regexp" location and not the "starts with" one... :( Yes, it is very unlikely for regular expression to look like a "normal" URI. But one is enough to have you spend hours trying to figure out what happens.

Hope you enjoyed this and for those of you who care, I have a little exercise: what is the algorithmic complexity of finding the location that applies to a request? Feel free to post your responses so we can start arguing who's right ans who's wrong. ;)

Sunday, April 17, 2011

Day 45 - location regexp or no regexp

Sometimes when you are wandering in a foreign land, something at the limit of your vision that is always there... You never look at it straight because... Well, because it's not really something you're interested in. But after some time you always end up looking at it carefully. I guess this is curiosity or just a side effect of repetition. So, I have seen so many location here and there that I decided to have a closer look at them.

The location documentation is really good and probably answers all the questions you might have trying to get your configuration working. And you should stop reading this post about right now. Still there? Good, I was kidding... Let's have a more pragmatic approach than the official documentation and let's have a look at examples. I would love to call them "real-life" examples but they are not. They are more "teaching" examples: built specifically to show something.

A location is really the basic block of a nginx configuration and it is very different from what you are used to with Apache (for example :)). So, this gets everybody confused at the beginning but once you get used to it, you'll come to like (or even love) it. A location block is where you define how to handle requests that match this location. Very high-level you could define nginx request processing as:

  1. Parse the request.
  2. Find the matching location.
  3. Process the request based on what is in this location.

There are different kinds of locations. We are going to start with the basic ones: the "exact" location and the "starts with" location (yes, I'm using the Echo module, it's just so convenient for this kind of things):

    location /foo {
        echo foo;
    }
    location = /foo/bar {
        echo foo-bar-eq;
    }
    location /foo/bar {
        echo foo-bar;
    }

The location with the = is the "exact" location and it's the easiest one to explain: if the request is on the URI that is after the = sign, the content of the location block is used to process the request. So, in the example above, GET /foo/bar returns foo-bar-eq (with a \n appended because it's the way of the echo module - or the unix command for that matter).

If, like me you are stupid enough to try it, having the same exact location twice does not work and will prevent nginx to start with:

[emerg]: duplicate location "/foo/bar" in .../nginx.conf

The feature to consider duplicate locations an error works also for the other types of locations (provided location type and URI are the same). And it's a pretty damn good feature because this way you won't get errors due to an extra copy-paste.

The other type of location in the example above is the "starts with" location. If the request starts with the URI defined, then this location will be used. In our example, that's why GET /foo/bari returns foo-bar, just like GET /foo/bar/ but unlike GET /foo/bor which returns foo. You saw me coming: there is some prioritization going on here. nginx uses the longest "starts with" location that matches the request. Not the first one in the configuration order. That's why our config returns foo-bar and not foo when requesting GET /foo/bari

Now, let's add some hot sauce to this with "regular expression" locations. Yes, of course nginx support them. If not, how would you forward all the ".php" requests to your fastcgi backend for processing. So, let's have a look at this configuration:

    location ~ ^/foo {
        echo foo;
    }
    location ~ ^/foo/bar {
        echo foo-bar;
    }

~ is the operator identifying that the following string should be considered a "regexp" (there is also a ~* operator that makes the regular expression match case insensitive). If you know about regular expressions and look attentively at the regular expressions, you cannot fail to see the similarity with the previous example (reminder for those of you who skipped the regular expressions course: ^ is the regular expression equivalent of "starts with"). And this is where things get messy: GET /foo returns foo but GET /foo/bari returns foo. Yeah, baby! If you want the same behavior as before, well you just have to write your configuration this way:

    location ~ ^/foo/bar {
        echo foo-bar;
    }
    location ~ ^/foo {
        echo foo;
    }

So, with regular expressions the order of appearance in your config file is important. Yes, this is bad and you should be careful with it (especially when your configuration grows and the regular expressions don't hold on one screen anymore...). Putting most specific regular expressions first is probably the most natural way (please note that sometimes there is no way to figure out which expression is the most specific, as in ~ /foo/ versus ~ /bar/: both match /a/foo/bar/b and /a/bar/foo/b...).

Now, if you think this is the end of it, think again and look at this:

    location /foo/bar {
        echo foo-bar-starts-with;
    }
    location ~ ^/foo/bar {
        echo foo-bar-regexp;
    }

What is the response to GET /foo/bari? Les jeux sont faits, rien ne va plus... And the winner is foo-bar-regexp. Yes, "regexp" locations take over "starts with" locations (regardless of the order of appearance, as you noticed from the example).

Now is a good time to introduce the "start with before regexp" location (operator is ^~):

    location ~ ^/foo/bar {
        echo foo-bar-regexp;
    }
    location ^~ /foo/bar {
        echo foo-bar-starts-before-regexp;
    }

As you would expect, GET /foo/bari returns foo-bar-before-regexp. The operator is kind of misleading because it's very similar to the regexp one. To remember I use the following trick : ~ is for regexp and ^ denotes the beginning and the negation in regexp world. So ^~ "almost naturally" conveys the meanings that it comes before regexp and that it is not a regexp. YMMV, of course... ;)

The last type of location is very different from the rest of the gang because it is not used during "the search for a location" on a request. It is a "named" location and will be available only as the target of a rewrite (or something similar). Here is an example:

    location @/foo/bari {
        echo foo-bar-internal;
    }
    location ^~ /foo/bar {
        echo foo-bar-starts-before-regexp;
    }

Just like before a GET /foo/bari returns foo-bar-starts-before-regexp. Now, we just proved that named locations are not used for external requests processing. I'm not going to give you an example of using a named location in a rewrite rule because this is far beyond the subject of this post. Now, please note that the @ operator is not followed by a space (unlike the others location operators). Actually, it is pretty much standard to use something that "does not look like a URI" as the name of a named location. Something like location @fallback.

So, with all this you see the order of precedence for locations:

  1. "Exact" location (= operator).
  2. "Starts with before regexp" location (^~ operator). Inside this, longest strings comes first.
  3. "Regular expression" location (~ or ~* operator). Inside this, order of precedence is order of appearance in configuration file.
  4. "Starts with" location (no operator). Inside this, longest strings come first.

You see there is quite room for shooting oneself in the foot. That is probably why you will see most of Igor's configurations (in replies to the mailing-list) avoid using regular expressions. I think this has to do with the difficulty there is to figure out which location is used when reading the configuration but also with the fact that regexp trial-and-error method to find the matching location is slow compared to the method used for literal strings (a red-black tree). So, the recommanded way seems to be to use "nested locations" like this:

location /thumb- {
   location ~ ^/thumb-(.*)\.jpg$ {
       [...]
   }
}

With this, not only do you have a cleaner configuration (it's easier to figure out what happens to /thumb-... requests) but you also save CPU cycles by not even trying a regexp match (that will fail) on all requests that don't start with /thumb-. Oh, the wiki page says nested locatiosn are not recommended, but if the creator himself uses them, why shouldn't we? ;)

That should give you a little bit to ponder till my next post.

Tuesday, April 12, 2011

Day 44 - nginx 1.0.0 : half a century after Gagarin in space

I have been awfully silent lately but I could not let Igor launch nginx 1.0.0 without posting about it.

The master himself made the reference to the launch of Vostok 1 (which I heard of this morning). I just love the way this guy is thinking.

Poyekhali!

For those of you who were wondering, there seems to be not much new per se in the 1.0.0 version. All the companies I worked for I told marketing that they should "own" the left-most number of versions, engineering should "own" the right-most and the middle one is open for war... ;) Looks like Igor somehow agrees with me on this.

One more thing: now it's available through svn: svn://svn.nginx.org. And for the git fans (hello Freddy) or svn allergics (hello Linus), there is always git-svn